"Minerva attack can recover private keys from smart cards, cryptographic libraries", says the ZDNet headline. "The Czech team found a problem in the ECDSA and EdDSA algorithms used by the Atmel Toolbox crypto library to sign cryptographic operations on Athena IDProtect cards."

The Minerva research team's web page claims "practical recovery of the long-term private key" from various "implementations of ECDSA/EdDSA in programmable smart cards and cryptographic software libraries". Concretely:

- The team's experiments reportedly recovered an ECDSA key from a FIPS-certified CC-certified "Athena IDProtect card with CPLC data 010b.0352.0005".
- The team located seven other certified devices using the same ECDSA implementation. Presumably the attack also works against those devices. The Golem article on Minerva says "Einmal mehr wirft der Vorfall Fragen dazu auf, wie sinnvoll solche Zertifizierungen sind" [Once again, the incident raises questions about how useful such certifications are].
- The team's measurements make reasonably clear that a similar attack would work against the ECDSA implementations in libgcrypt 1.3.0 through 1.8.4, MatrixSSL through 4.2.1, SunEC/OpenJDK/Oracle JDK 7 through 12, and Crypto++ through 8.2.0. It's possible that the attack would also work against wolfSSL/wolfCrypt through 4.0.0, but the wolfSSL problem "would be very hard to exploit".

Wait a minute. If this was supposed to be an "ECDSA/EdDSA" attack, a shared "problem in the ECDSA and EdDSA algorithms", then why do these examples say just "ECDSA" and "ECDSA" and "ECDSA"?

Here's one theory you might come up with: "You're being selective in your summary. I looked through the Minerva page the day it was originally announced, and it also claimed to break the EdDSA implementation in libgcrypt."

Facts: The team did make this claim, but withdrew the claim two days later, for reasons that I'll explain below. I'm not being selective in my summary.

Here's another theory you might come up with: "Fundamentally, the attack applies equally to ECDSA and EdDSA. The only reason that there are more ECDSA attack reports is that ECDSA is more widely supported."

Facts: I looked at MatrixSSL, JDK, Crypto++, and wolfSSL/wolfCrypt. I noticed EdDSA (specifically Ed25519) implementations in everything except JDK. The Minerva team does not claim to break the EdDSA implementations in libgcrypt, MatrixSSL, Crypto++, and wolfSSL/wolfCrypt.

The Minerva web page also says that
OpenSSL, BouncyCastle, BoringSSL, libtomcrypt, Botan, Microsoft CNG, mbedTLS, Intel IPP-Crypto, and 11 cards
were "tested and found not vulnerable".
I noticed EdDSA implementations in
OpenSSL, BouncyCastle, BoringSSL, libtomcrypt, and Botan.
In total,
the Minerva team is claiming to break ECDSA in 4 or possibly 5 out of 13 crypto libraries supporting ECDSA,
and EdDSA in 0 out of 9 crypto libraries supporting EdDSA.
The bottom line is that the attack *doesn't* apply equally to ECDSA and EdDSA.

I'm not saying that it's impossible to implement EdDSA insecurely.
Of course it's *possible* to implement EdDSA insecurely.
Maybe there are people using insecure EdDSA implementations today.
But the Minerva data points support the theory that
**EdDSA implementations are less likely to be broken than ECDSA implementations**.
This isn't an accident: it's how EdDSA was designed in the first place.

In the rest of this blog post, I'll take a closer look at the information exploited by the Minerva attack, and at the EdDSA implementations mentioned above. Thanks to Tanja Lange for help with this blog post.

**Discrete logarithms.**
In
ElGamal's signature system,
the signer's public key is g^{x} mod p.
Here p is a standard prime,
g is a standard generator mod p,
and x is the signer's long-term signing key,
chosen uniformly at random from {0,1,...,p−2}.

Computing (g^{x})^{(p−1)/2} mod p reveals x mod 2:
it's 1 if x is even, or p−1 if x is odd.
More generally, for any divisor q of p−1,
one can check a guess for x mod q as follows:
x is congruent to (say) 7 modulo q
if and only if (g^{x})^{(p−1)/q} = (g^{7})^{(p−1)/q}.
This isn't helpful when q is large:
any particular guess for x mod q is very unlikely to be correct,
and ruling out one guess still leaves many more possibilities.

The Pohlig–Hellman discrete-logarithm algorithm is faster,
finding x in only about sqrt(q) operations where q is the largest prime dividing p−1,
but let's assume that all prime divisors of (p−1)/2 are large.
There are also discrete-logarithm algorithms that take time subexponential in the number of bits of p,
but let's assume that p is so large that this isn't a threat.
Shor's quantum discrete-logarithm algorithm takes polynomial time,
but I'll assume throughout this blog post that the attacker doesn't have a quantum computer.
It then seems very difficult to learn anything about x mod (p−1)/2 from g^{x} mod p.

**Linear equations in discrete logarithms.**
The first step in signing a message in ElGamal's system is
to choose a "random number k, uniformly between 0 and p−1, such that gcd(k,p−1) = 1."
The gcd condition forces k to be odd.
We're assuming that all prime divisors of (p−1)/2 are large,
so the gcd condition is very unlikely to have any other effect.

ElGamal's signatures reveal the coefficients M,R,S of a linear equation M=xR+kS modulo p−1. This is equivalent to an equation M=xR+kS modulo each of the prime-power divisors of p−1. In particular, M=xR+kS=xR+S modulo 2, revealing x mod 2 if R is odd; but we already knew x mod 2, as explained above.

The rest of the equation is M=xR+kS modulo (p−1)/2,
or equivalently M=xR+kS modulo q for each prime-power divisor q of (p−1)/2.
This doesn't seem helpful for the attacker,
since k mod q could be *almost* anything.
For each prime q dividing (p−1)/2,
the attacker knows that k isn't 0 modulo q,
i.e., that M isn't xR modulo q,
which (assuming R is invertible modulo q)
is the same as ruling out M/R as a possibility for x mod q;
but, again, ruling out one guess for x mod q still leaves many more possibilities.

Another signature from the same signer reveals another equation M_{2}=xR_{2}+k_{2}S_{2}.
ElGamal emphasizes that the signer must choose a new k_{2} for M_{2},
rather than taking k_{2}=k,
since taking k_{2}=k allows the attacker to solve these two linear equations for x.
Repeating k is what
destroyed the security of the Sony PS3
many years later.

Taking a new k_{2} seems to eliminate this problem:
each possible x is compatible with some solution (k,k_{2}) to the two equations,
except for the invertibility issue described above.
More generally, ElGamal concludes that these equations for any number of signatures
are compatible with a huge number of possibilities for x
as long as "each signature uses a different k".

**Schnorr's use of a prime-order subgroup.**
There's a lot to be said about
improvements from ElGamal's system to the Schnorr system and EdDSA.
One of the improvements, simplifying the security analysis,
is to take g as a generator of a subgroup of prime order q
rather than as a generator of the entire group of order p−1.

Each Schnorr signature reveals the coefficients M,R,S of a linear equation M=xR+kS modulo q. Almost certainly S is nonzero modulo q, so let's simplify this linear equation to Y=xX+k modulo q, where Y=M/S and X=R/S. Schnorr actually obtains X and Y in a simpler way, eliminating all divisions and in particular eliminating the need to require gcd{k,q}=1. Schnorr takes k as a random number in the set {1,...,q}.

**How NSA biased the linear equations in DSA.**
NSA's original DSA proposal
uses prime-order subgroups as in Schnorr's system,
revealing an equation Y=xX+k modulo q for each signature.
Other DSA details are closer to ElGamal's system,
in particular using divisions by k modulo q,
so k is specified to be in the set {1,...,q−1}.
The same comments apply to the subsequent NIST standards for DSA and ECDSA.

NSA's proposal included a specific prime q,
namely 0xb20db0b101df0c6624fc1392ba55f77d577481e5,
about 0.7 times 2^{160};
and a specific prime p,
which I won't repeat here,
and which is around 0.8 times 2^{512}.
A subsequent update included another prime q,
namely 0xc773218c737ec8ee993b4f2ded30f48edace915f,
about 0.8 times 2^{160};
and another 512-bit prime p.
All of these primes were generated by hashing
in a way that seems to make it very difficult to target particular choices of p and q.
(As a side note, hashing was also used to generate shared parameters
for various other cryptosystems,
but was avoided in NSA's subsequent
Dual EC DRBG proposal.)
All of these primes are much too small to resist discrete-log attacks,
but the problems I'm about to describe also show up for larger DSA parameters.

How do we generate a uniform random element k of {1,...,q−1}?
If q is, say, 2^{160}+1 then this is easy:
generate 160 random bits, interpret the result as an integer, and add 1.
But 2^{160}+1 isn't prime, and, more to the point, isn't either of the random-looking primes q mentioned above.

The literature has many good answers to this question,
and then it has NSA's answer, which I'll get to in a moment.
One good answer, rejection sampling, is to generate a uniform random 160-bit integer
(let's assume q is between 2^{159} and 2^{160})
and start over if the integer is outside {1,...,q−1}.
Another good answer, with more predictable performance,
is to generate a uniform random 320-bit integer and reduce that integer modulo q.
The resulting distribution isn't *exactly* uniform but it's so close that you'll never be able to tell the difference.
Technically, this doesn't guarantee that k is in {1,...,q−1},
but you'd have to be amazingly unlucky for your 320-bit integer to be a multiple of q.

NSA's answer is to take a much shorter integer,
a 160-bit (not 320-bit) hash of some new secret,
and reduce the hash modulo q.
Presumably the 160-bit hash is hard to distinguish from a uniform random element of
{0,...,2^{160}−1},
but the hash modulo q looks quite different from a uniform random element of {1,...,q−1}.
For example, if q is NSA's 0xb20db0b101df0c6624fc1392ba55f77d577481e5,
then any integer up through 0x4df24f4efe20f399db03ec6d45aa0882a88b7e1a is hit by *two*
elements of {0,...,2^{160}−1},
while any larger integer is hit by only *one* element of {0,...,2^{160}−1}.

**Interlude: Understanding the importance of distributions for security.**
Attack algorithms often work better for some inputs than for others.
For a proper security analysis one has to be clear about the distribution of inputs.
Changing the distribution can increase or decrease security levels,
and can easily make the difference between a broken system and an unbroken system.

Here's a non-DSA example.
How difficult is it, by known attacks,
to find all of the prime factors of an integer N in {1,...,2^{2048}}?
This question is incomplete, because the distribution of inputs N is unspecified.
Here are four possibilities for the distribution:

- Uniform random element of {1,...,2
^{2048}}. Often N has only one large prime factor, and then ECM factors N quite easily. - Narrower: Product of two independent uniform random primes in {1,...,2
^{1024}}. This is harder to factor, but occasionally N is, say, 30 bits smaller than expected, which should make NFS almost twice as fast. - Even narrower: Product of two independent uniform random primes in {2
^{1023},...,2^{1024}}. This eliminates all of the small cases described above, and is the hardest factoring problem on this list. - Even narrower: Product of two 1024-bit Infineon primes. This is much easier to factor.

These examples show how unsafe it can be to assume that different distributions have the same security level. Comparing the first, second, and third cases shows that reducing randomness can improve security. Comparing the second and third cases to the fourth case shows that reducing randomness can damage security.

Note that these examples rely critically on understanding which inputs are broken by various factorization algorithms. Don't believe people who try to replace serious algorithm analysis with some superficial rule that more randomness is always better. See also Section 11.1 of 2008 Koblitz–Koblitz–Menezes for more subtle examples.

**Security analysis of DSA biases, part 1: lattice attacks.**
Back to DSA.
The security of a *single* signature,
and in particular the amount of information revealed by the equation Y=xX+k modulo q,
can't be affected much by NSA's change in the distribution of k.
An attack that works with a particular probability against NSA's uniform random 160-bit k
also has at least half as much probability of working against ElGamal's uniform random nonzero k modulo q.
The point is that NSA produces any particular integer modulo q with probability at most 1/2^{159}
(assuming q is at least 2^{159}),
while ElGamal produces the same integer with probability 1/(q−1),
which is at least 1/2^{160} (and even somewhat larger than this for the choices of q mentioned above).
Whichever NSA choices of k are affected by the attack
have at least half as much chance of appearing in ElGamal's choices of k.
(This is an example of a "divergence argument".)

The same argument also says that NSA can't lose more than 2 bits of security for 2 signatures, more than 3 bits of security for 3 signatures, etc. But this is a useless guarantee for a signer generating, say, 1000 signatures.

ElGamal went beyond a single equation
and considered the whole system of equations revealed by the signer:
Y_{1}=xX_{1}+k_{1} modulo q,
Y_{2}=xX_{2}+k_{2} modulo q,
Y_{3}=xX_{3}+k_{3} modulo q,
etc.
Any particular k from NSA has nearly as much chance of having been generated by ElGamal,
but the whole sequence (k_{1},k_{2},k_{3},...) from NSA is much less likely to be generated by ElGamal.
(The "divergence" grows rapidly with the number of signatures.)
What effect does this gap have on security?

What we know about NSA's (k_{1},k_{2},k_{3},...) is that each k_{j} favors elements of {0,...,2^{160}−1−q}
over elements of {2^{160}−q,...,2^{160}−1}.
Let's consider a simpler and larger bias,
where each k_{j} is guaranteed to be in a smaller range {0,...,2^{b}−1}.
The attacker knows that Y_{j}=xX_{j}+k_{j} modulo q,
i.e., that the vector (Y_{1},Y_{2},Y_{3},...) is the surprisingly short vector (k_{1},k_{2},k_{3},...) plus x(X_{1},X_{2},X_{3},...) plus a multiple of q in each coordinate.
In other words,
(Y_{1},Y_{2},Y_{3},...) is surprisingly close to the public lattice consisting of all multiples of (X_{1},X_{2},X_{3},...) plus all multiples of q in each coordinate.
Finding the lattice vector surprisingly close to (Y_{1},Y_{2},Y_{3},...) immediately reveals x.

Howgrave-Graham and Smart
introduced this lattice view of biased DSA in 1999,
and reported being able to quickly find x from "only 30 signed messages" with 152-bit k_{j},
using a simple algorithm to find the close lattice vector.
Howgrave-Graham and Smart also pointed out that there's nothing special about having the k_{j} range start from 0.
For example,
if one knows that k_{1} is in the interval {37*2^{152},...,38*2^{152}−1},
then one can subtract 37*2^{152} from Y_{1} and from k_{1},
reducing to the case that k_{1} is in the interval {0,...,2^{152}−1}.
Or one can reduce to the case that k_{1} is in the balanced interval
{−2^{151},...,2^{151}−1},
which is even better for the lattice attacks.

"It is important that no bits of the ephemeral keys are leaked for whatever reason",
Howgrave-Graham and Smart wrote.
If you assume that "important" means "important for security against known attacks",
then you can object that Howgrave-Graham and Smart were overstating their results.
They had reported an attack using 8 bits of each k_{j};
this does not imply an attack using, say, 4 bits of each k_{j}.
Followup papers plugged fancier lattice algorithms into the same ideas,
showing
that an attacker knowing 4 bits of each k_{j} could quickly find x,
but you can still object that this doesn't apply to
NSA's leak of roughly 0.6 bits of information about each k_{j}.

On the other hand,
as a cryptosystem designer,
I don't want security reviewers to have to spend time analyzing fancy attacks
that I could have eliminated by tweaking my design.
Security review is the best protection that users have against attacks;
security failures come primarily from having security review spread too thin.
From this perspective,
it *is* important to eliminate the leak of information about k.

**Security analysis of DSA biases, part 2: Bleichenbacher's attack.**
In October 2001, NIST issued
a new DSA standard with a change notice
saying that an "unpublished attack" had broken NSA's biased DSA randomness
using a "workfactor of 2^{64}" and "2^{22} known signatures".
This statement doesn't make clear what the units are for measuring this "workfactor",
i.e., how complicated each of the 2^{64} steps is.
The statement doesn't even make clear whether the attack is for NSA's 512-bit primes
(which were already known to allow feasible discrete-logarithm attacks),
or for 1024-bit primes, or both.

Earlier news
had attributed the attack to Bleichenbacher,
and had quoted "Edward Roback, chief of the Computer Security Division in NIST's Information Technology Laboratory"
as saying that
"those who are using DSA can continue to use it with confidence that DSA signatures done under the present standard will remain secure for many more years".
I'm not aware of any publications explaining why NIST thought
that attackers couldn't reach a "workfactor of 2^{64}".

Bleichenbacher privately reported his attack to an IEEE working group in November 2000, but as far as I know didn't immediately post the slides or any other attack details. A small amount of information about the attack appears in Section 5 of a 2001 report by Vaudenay. Bleichenbacher's slides (TIFF; here's a PDF version) were posted in late 2004. The first serious writeup I saw was a 2013 paper by De Mulder, Hutter, Marson, and Pearson.

Here's a concrete example to illustrate some of the techniques.
Bleichenbacher numerically measures the bias of k as the average over k of the complex number exp(2πik/q).
If k were uniformly distributed modulo q
then this complex number would be equally spaced around the unit circle in the complex plane
and the average would be 0.
However, k isn't uniformly distributed modulo q.
Let's take NSA's original q, around 0.7 times 2^{160};
a short calculation shows that
the bias of k is around 0.05+0.21i.

If k_{1} and k_{2} are independent identically distributed variables
then the bias of k_{1}+k_{2} is the average of exp(2πi(k_{1}+k_{2})/q) = exp(2πik_{1}/q)exp(2πik_{2}/q),
which factors as the bias of k_{1} times the bias of k_{2}.
Similarly, the bias of k_{1}−k_{2} is the bias of k_{1} times the complex conjugate of the bias of k_{2}.

Consider many equations Y_{j}=xX_{j}+k_{j} modulo q from, say, 2^{40} signatures under one key.
Assume that each X_{j} is reduced modulo q to the range {0,1,...,q−1}.
Sort the equations into increasing order of X_{j}.
Consider a difference of two adjacent equations:
Y_{j+1}−Y_{j}=x(X_{j+1}−X_{j})+(k_{j+1}−k_{j}).
Presumably each difference X_{j+1}−X_{j} will be on the scale of q/2^{40};
a more precise analysis would look at the distribution of these differences.
Meanwhile the difference k_{j+1}−k_{j} would have bias (0.048+0.21i)(0.048−0.21i), around 0.047,
if k_{j+1} and k_{j} were independent;
presumably the sorting of X_{j} doesn't noticeably change this bias.

Repeat this sort-and-difference process three times,
reducing each X first to about 120 bits with k bias around 0.047,
then to about 80 bits with k bias around 0.047^{2},
then to about 40 bits with k bias around 0.047^{4}, i.e., around 2^{−17.6}.
Write the resulting equations as Y'_{j}=xX'_{j}+k'_{j},
where each X'_{j} has about 40 bits, and each k'_{j} has bias around 2^{−17.6}.

Say we have a *rough* guess G for x,
and we compute the average A(G) of exp(2πi(Y'_{j}−GX'_{j})/q),
i.e., the average of exp(2πi(x−G)X'_{j}/q) exp(2πik'_{j}/q).
For most choices of G, this is an average of 2^{40} random-looking points on the unit circle,
so presumably it has size around 2^{−20}.
However,
if G is within 2^{115} of x
then the differences (x−G)X'_{j} are bounded by 2^{155}
(this is where the smallness of X'_{j} is important),
so exp(2πi(x−G)X'_{j}/q) is close to 1,
so presumably A(G) is close to the bias of k'_{j}, around 2^{−17.6},
which is noticeably bigger than 2^{−20}.

Now try, e.g., G=0, G=q/2^{45}, G=2q/2^{45}, G=3q/2^{45}, etc.
It's easy to use a fast Fourier transform to compute A(G) simultaneously for all such G
with only about 2^{50} multiplications of complex numbers.
Hopefully the largest A(G) value will come from some G close to x,
revealing the top bits of x,
and then one can repeat to find the remaining bits of x.
On the other hand, maybe 2^{−20} is too much noise compared to 2^{−17.6}.
One can try several of the largest A(G) values,
or reduce the noise by using more equations,
as I'll now explain.

How can we generate more equations without starting from more signatures?
One approach is to go beyond adjacent differences X_{j+1}−X_{j},
also considering X_{j+2}−X_{j} and X_{j+3}−X_{j} and so on up to some size limit.
Presumably there are about 2^{50} differences below, say, q/2^{30},
and then about 2^{50} differences below q/2^{80} at the next level,
and then about 2^{50} differences below q/2^{130} at the next level.
It's tricky to analyze the randomness of lists that include,
e.g., the difference X_{1}−X_{0}
and the difference X_{2}−X_{1}
*and* the difference X_{2}−X_{0}, which is the sum of X_{1}−X_{0} and X_{2}−X_{1};
one should carry out experiments
to see how well the algorithm actually works.
(One should carry out experiments even
for algorithms where someone claims to have a complete analysis.)

Another approach is to start with only 2^{34} signatures and consider all sums X_{j}+X_{i} mod q.
Presumably there will be almost 2^{67} different sums,
and more than 2^{50} cases where two sums are within q/2^{80} of each other,
and more than 2^{50} differences below q/2^{130} at the next level.
As Bleichenbacher notes in his slides,
there are standard techniques to find nearby sums with far less memory than storing and sorting all the sums.
(For each j, build a low-memory generator that outputs X_{j}+X_{i} for all i in increasing order,
and then merge the outputs of these generators.
Bleichenbacher credits 1981 Schroeppel–Shamir.
My sortedsums paper
cites a 1973 Knuth exercise crediting William S. Brown.
It wouldn't be surprising if there are even earlier references.)

Or, starting with fewer signatures, we could consider all sums of four X values,
and directly look for 2^{50} sums within q/2^{130} of each other.
This is essentially the same as looking for 2^{50} sums that collide on the top 130 bits.
If you think it's obvious how quickly such searches can be carried out,
then try to explain why the
2010 Howgrave-Graham–Joux subset-sum algorithm has exponent around 0.337n,
and why the
2011 Becker–Coron–Joux subset-sum algorithm has exponent around 0.291n,
and why the
2019 Esser–May subset-sum algorithm has exponent around 0.255n.

**An even larger cloud of attacks.**
Back in October 2000,
a month before Bleichenbacher's IEEE presentation,
Blum, Kalai, and Wasserman (BKW)
had introduced an algorithm to find a secret vector x of integers modulo 2,
given noisy dot products xX_{1}+k_{1}, xX_{2}+k_{2}, etc.
Each X_{j} is a random public vector of the same length as x,
and each k_{j} is a bit set with some specified probability smaller than 1/2.

The BKW algorithm sorts X_{j} according to their leading bits,
and then computes successive differences X_{j+1}−X_{j},
obtaining noisier dot products x(X_{j+1}−X_{j})+(k_{j+1}−k_{j}).
The BKW algorithm repeats this sort-and-difference process
to clear more and more bits of X,
at the expense of more noise.
Eventually it finds many noisy dot products x(0,0,...,0,1)+k'_{j},
takes the majority vote of these dot products to see the last bit of x, and similarly finds the other bits of x.

The problem considered by BKW, where x is a vector modulo 2 and some k_{i} bits are set,
is called the "learning parity with noise" (LPN) problem.
In the "learning with errors" (LWE) problem,
x is a vector modulo q and each k_{i} is a small integer.
LWE for length-1 vectors, the target of Bleichenbacher's attack,
is called the "hidden-number problem" (HNP),
an example of how scientific terminology
can be selected to minimize the amount of information communicated to the uninitiated.

Followups to the BKW paper adapted the BKW idea to the LWE problem
and to various lattice problems,
and eventually added further speedup ideas that Bleichenbacher had used,
such as fast Fourier transforms ("fast Walsh transforms" in the mod-2 case) and non-adjacent differences.
An analysis in 2015 by
Herold, Kirshanova, and May
showed that one of the latest BKW variants
sometimes had better asymptotics than basis reduction
as an attack against reasonable-looking LWE parameters for lattice-based cryptography;
look at Figure 1 in the paper to get a quick idea of how close the competition is.
Papers are continuing to appear with basis-reduction speedups *and* with speedups for various BKW/Bleichenbacher-type algorithms.
The literature is complicated, and is continuing to become more complicated.

As a cryptosystem designer, I'm much happier with the literature on conservative prime-field ECC. There's a much simpler state-of-the-art attack algorithm that includes all known optimizations. There's a convincing high-precision analysis of the performance of the algorithm. The literature has clear explanations of quantitative obstacles to other attack ideas such as pairings and decomposition, and has no ideas for getting around these obstacles. But NSA's DSA biases spoil the simplicity of the attack picture, forcing the security reviewer to consider a much more complicated collection of worrisome attack strategies.

**Overconfidence in limits on the number of signatures.**
NIST's 2001 change notice for the DSA standard
provided two options for users to "defend" against Bleichenbacher's attack.
The first option was for users to continue using NSA's biased DSA randomness
but sign at most 2^{21} signatures per key,
half of the "2^{22} known signatures" used by the "unpublished attack".

Why did NIST think that a limit of 2^{21} signatures would be safe?
Was there some quantitative analysis saying that reducing 2^{22} signatures to 2^{21} signatures
would increase the attack cost to something clearly infeasible?
And what made NIST confident that a new "unpublished attack" was optimal?

Bleichenbacher needs enough equations to compensate for the noise.
Sorting equations and then subtracting *adjacent* equations, as in the BKW paper,
doesn't expand the number of equations available.
But Bleichenbacher's slides were explicitly generating
far more equations as sums of small non-adjacent subsets of the original equations.
I don't understand how anyone could have been confident regarding the tradeoffs here,
even before seeing this decade's breakthroughs in subset-sum algorithms.

Here are further examples to illustrate the lack of maturity of the analysis of these tradeoffs. In 2012, Damgård and Park proposed some cryptosystems based on the supposed hardness of the following example of LPN. There is a secret n-bit vector x, where n is 768. The attacker sees a limited number of noisy dot products xX+k, where X is public and the noise k is 1 with probability 1.5%. The limit is a small constant times n, where the constant depends on the cryptosystem variant. The original version of the paper stated that BKW needs as input a gigantic number of "samples" xX+k, while, for the paper's cryptosystems, "the adversary only gets O(n) samples, which is the hardest case, so this suggests that relatively small values of n might be sufficient for security". The paper was then revised to account for "recent attacks on LPN that invalidated the conclusions of the previous version". See, e.g., my Ring-LPN attack paper with Lange.

In the LWE context, a 2014 paper by Albrecht, Faugère, Fitzpatrick, and Perret stated that "one of the main remaining obstacles for applying the BKW algorithm to cryptographic constructions based on LWE is that it requires an unbounded number of samples to proceed. Lifting this requirement, if only heuristically, is hence a pressing research question." Followup papers then pointed out that one could use far fewer samples.

**Safe fixes for DSA.**
Instead of trying to pretend that the limits of these attacks are clear,
let's eliminate the applicability of the attacks to DSA,
by fixing the problem at its source.
The problem is that—for NSA's choices of q around, say, 0.7 or 0.8 times 2^{160}—taking
160 bits of randomness modulo q
produces a biased result.

If we modify q, moving it extremely close to 2^{160} or to 2^{159},
then the bias rapidly disappears.
This is also naturally accomplished by ECDSA
*if* we choose a prime p extremely close to 2^{160}
(rather than, say, the Brainpool "random" p)
and choose an elliptic curve with a power-of-2 cofactor such as 1 or 2 or 4 or 8:
q will then be extremely close to 2^{160} or 2^{159} or 2^{158} or 2^{157} respectively.

The bias also rapidly disappears if we take more bits of randomness.
As I mentioned above, if q is between 2^{159} and 2^{160},
then you'll never be able to tell the difference between a uniform random 320-bit integer mod q
and a uniform random element of {0,...,q−1}.
This was NIST's second option for users to protect themselves:
concatenate *two* 160-bit hashes to obtain a 320-bit integer.

Ed25519 has both of these defenses.
It uses Curve25519, with a subgroup of prime order extremely close to 2^{252}.
It specifies k as a deterministic, easily testable SHA-512 hash of a secret and the message being signed.
Of course, Ed25519 also moves from NSA's
dangerously small cryptosystem sizes
up to a safe security level.

**Timing attacks against DSA.**
The reason this isn't the end of the story is that mathematical biases in the construction of the secret k
aren't the only way to leak bits of k.
For example,
a 2011 paper by
Brumley and Tuveri
steals "the private key of a TLS server where the server authenticates with ECDSA signatures".
The attack was demonstrated against OpenSSL using the NIST B-163 curve.
The choice of q here isn't the issue; q is extremely close to a power of 2.

The paper title is "Remote timing attacks are still practical".
OpenSSL computed the kth multiple of an elliptic-curve point
with an algorithm whose time revealed the position of the top bit in k:
the algorithm took the maximum time if k was between 2^{b−1} and 2^{b},
less time if k was between 2^{b−2} and 2^{b−1}, etc.
Sometimes timings would show, e.g., that k is below 2^{b−8},
putting the attacker into the situation of the fast Howgrave-Graham–Smart attack.

This was just one in a long line of papers using timing attacks to break DSA, ECDSA, etc. There are, for example, papers extracting keys more efficiently from timing information, and there are papers collecting such precise timing information that key extraction becomes trivial, and now there's Minerva applying timing attacks to a remarkable number of different targets.

**How to stop timing attacks.**
A programmer who
knows enough about CPU timing
can systematically write software that avoids all data flow from secrets to timings.
For example,
all of the C and assembly implementations from the Ed25519 team are written in this way,
as are the NaCl
and BearSSL crypto libraries.

(One can also randomize computations to try to obscure the information leaked through timings:
e.g., add a random 512-bit multiple of q to k before scalar multiplication.
However, it's very difficult for the implementor, and for reviewers,
to figure out how secure randomization is against timing attacks.
A better argument for randomization is that attackers sometimes have access to more invasive side channels
such as power measurements:
truly constant power seems very difficult to achieve,
and randomization seems to be our best bet.
I advise against randomization as a *replacement* for constant-time software,
but I don't object to randomization as an *extra* line of defense whenever the user can afford it.)

Saying that implementors *can* write constant-time software
doesn't mean that they *do* write constant-time software.
I see three basic reasons for this.
First, there's a huge amount of variable-time software in existing crypto libraries,
and getting rid of all of this software is a correspondingly huge effort.
Second, many popular cryptographic functions produce unacceptable slowdowns
when they're implemented in constant time.
Third, and most importantly, many popular cryptographic functions are much *simpler* to implement in variable time.

I noticed many years ago that the evolution of computer architecture was making it easier and easier
to resolve the tension between security and speed:
there's a synergy between what I want to do to avoid timing leaks
and what chip designers want to do for performance.
I've set various cryptographic speed records using constant-time software.
Sometimes there are still some minor speedups possible with variable-time software,
but the tension is much smaller than it was,
making it hard for people to argue that these speedups should outweigh security—especially
since computation in general is becoming less and less expensive.
I'm not saying that *all* cryptographic designs have such small penalties for constant-time implementations;
having the freedom to throw away NSA's designs has been an important part of this work.

I've also tried to resolve the tension between security and simplicity. Software for X25519, Ed25519, etc. does everything the typical user needs, is much simpler than traditional software aiming at the same security goals, and would gain very little in simplicity from using variable-time algorithms.

But now consider a crypto library that for years has included variable-time software
for NSA's elliptic-curve cryptosystems,
and that wants to keep supporting those cryptosystems for whatever reason,
but that also wants to add support for Ed25519.
Different notions of simplicity then produce different results.
It's simple for the library to copy an existing constant-time Ed25519 implementation;
this implementation might be close to the simplest possible code for Ed25519;
but the library author might think that there are simpler ways to
*simultaneously support Ed25519 and NSA's systems*.

**Case study #1: libgcrypt.**
In libgcrypt,
there's a function
`_gcry_mpi_ec_mul_point`
that computes kP on an elliptic curve E, given k and P and E as inputs.
The main loop inside the function has length
`mpi_get_nbits(k)`,
a simple example of what one must not do in constant-time software.
Actually, the function has different loops for different types of curves,
and a complete description of the loop length is more complicated,
but what matters is that a smaller k runs measurably faster.

The real fix, the constant-time approach,
would start by changing the interface to replace
`mpi_get_nbits(k)`
with a `maxscalarbits` specified by the caller.
But this would require going through dozens of functions that call
`_gcry_mpi_ec_mul_point`
and figuring out the appropriate `maxscalarbits` for each.
This is an example of tension between simplicity and security.
Even a small tension can push implementors to do something insecure.
The tension increases as more and more elliptic-curve protocols are implemented on top of the existing
`_gcry_mpi_ec_mul_point` interface.
It's not surprising that libgcrypt still has the old variable-time interface.

At this point one might think that, well, libgcrypt is leaking the number of bits of k, so the standard lattice attacks apply, whether the protocol is ElGamal or Schnorr or DSA or ECDSA or EdDSA. This is also what Minerva claimed at first. But let's take a closer look at how large k is.

The DSA and ECDSA standards say that k is between 0 and q.
By now there are several different standard methods for generating k.
libgcrypt's `_gcry_dsa_gen_k` uses one of these methods,
namely rejection sampling,
which I described above.

EdDSA uses a different name "r" for k,
and says that "r" is SHA-512 applied to various specified inputs.
This is what libgcrypt implements:
hashing the inputs with
`_gcry_md_hash_buffers`,
then reading the 64-byte result into a multiprecision integer `r`
with
`_gcry_mpi_set_buffer`,
then calling
`_gcry_mpi_ec_mul_point`.

Now `_gcry_mpi_ec_mul_point` leaks whether "r" has 512 bits, 511 bits, 510 bits, etc.
But this is useless for the attacker.
Say "r" has only 500 bits;
the attacker still has no idea what "r" mod q might be.
This is why Minerva
works against libgcrypt's implementation of ECDSA
and doesn't work against libgcrypt's implementation of EdDSA.

The situation would have been different if libgcrypt had inserted an extra line of code
to reduce modulo q before calling
`_gcry_mpi_ec_mul_point`.
But this wouldn't have been as simple,
so it's not such a surprise that this didn't happen.
(The extra reduction work would have produced a speedup,
but copying existing constant-time Ed25519 software would have produced a much larger speedup.)

I described this scenario in a CFRG message five years ago:

The best protection we know is to generate much longer nonces—such as the 512-bit nonces in Ed25519. Then the system isn't broken by a timing attack revealing the nonce length. Setting the high bit of such a long nonce is also perfectly safe.

Of course, an implementor can still get in trouble by (1) reducing these nonces and then (2) leaking the lengths of the reduced nonces through a variable-time scalarmult method. So, as another line of defense, we also choose curves that support simple, fast, constant-time ladders.

libgcrypt is in the scenario of the first paragraph: the libgcrypt timing reveals the number of leading zeros in Ed25519's 512-bit hash, but this doesn't break the system. libgcrypt doesn't reduce the nonces, so it doesn't get into the trouble described in the second paragraph.

There have been many other timing-attack vulnerabilities in libgcrypt,
and clearly there will be more.
We have to throw away variable-time crypto code *without* waiting for attacks to be demonstrated.
For example,
libgcrypt should use the constant-time ladders supported by Curve25519.
But this doesn't mean that Minerva broke libgcrypt's Ed25519 implementation:
libgcrypt was saved by another Ed25519 feature, the double-size hash.

**Case studies #2, #3, #4, #5, #6, #7, and #8.**
A quick skim suggests that
BoringSSL, Botan, Crypto++, libtomcrypt, MatrixSSL, OpenSSL, and wolfSSL/wolfCrypt
all started with upstream Ed25519 implementations that advertise being constant-time,
such as
`ref10` from the Ed25519 team,
`donna` from
Adam Langley and Andrew Moon,
`tweetnacl` from the
TweetNaCl team,
and
`fiat` from the
Fiat Crypto team.

Of course, a quick skim isn't the same as a serious security review. The sheer volume of code is horrifying for the reviewer. Unlike the upstream implementations, these seven crypto libraries don't declare time variability to be a bug (unless the bug is shown to be exploitable and, in the case of OpenSSL, remotely exploitable), so a reviewer who finds time variability isn't automatically rewarded with an acknowledgment of a valid bug report. The libraries have so much existing variable-time code that they're scared to declare time variability to be a bug.

So I'm certainly not guaranteeing that timing attacks will fail against all these Ed25519 implementations.
But, given how much of the code in these Ed25519 implementations is coming from people
who say they were engineering it to be immune to all timing attacks,
it wouldn't be surprising if it *is* immune to all timing attacks.

Does the prevalence of constant-time implementations of a primitive mean that the primitive encourages such implementations? Superficial answer: Yes. New implementations can simply copy the existing constant-time implementations. This is what these seven libraries did, protecting the users. (But, again, one has to verify that the upstream implementations really are constant-time, and that this wasn't screwed up by whatever modifications happened as part of the library integration.)

Deeper answer: Not necessarily. Maybe the primitive is painful to implement in constant time. Maybe the first implementors suffered through doing this, but new implementations will cut corners.

In the case of Ed25519, it's clear from a detailed design comparison that Ed25519 is much friendlier to constant-time implementations than NSA ECC is. This doesn't guarantee constant-time implementations, but it does make them more likely.

**Case study #9.**
Finally,
I looked briefly at BouncyCastle,
which wrote everything from scratch in Java.
BouncyCastle is full of variable-time code.
I don't immediately see why BouncyCastle wasn't broken by Minerva.
BouncyCastle uses scalars smaller than q for scalar multiplication
for ECDSA *and* EdDSA, unlike libgcrypt,
so BouncyCastle *isn't* protected by the double-size hash.

If BouncyCastle's implementations of ECDSA and EdDSA *are* broken by timing attacks
then the tallies turn into
13 crypto libraries supporting ECDSA with timing attacks on 5 or possibly 6,
and
9 crypto libraries supporting EdDSA with timing attacks on just 1.
This still sounds like a big win for EdDSA overall,
but it's interesting to contemplate ways that we could have saved BouncyCastle too.

Why didn't BouncyCastle use a simple constant-time ladder? This is clear from the code: BouncyCastle has many complications that are aiming for speed at the expense of simplicity. In particular, BouncyCastle sacrifices a small amount of code simplicity to reduce k mod q, and sacrifices even more code simplicity to use precomputed tables of various multiples of the fixed base point P. These steps obviously make signatures several times faster.

Precomputed tables are still compatible with constant-time code, as shown in previous Ed25519 implementations. Why didn't BouncyCastle use this approach? This isn't as clear. Variable-time fixed-base scalar multiplication is at best marginally faster than constant-time fixed-base scalar multiplication. Perhaps there haven't been enough tutorials showing how to implement constant-time fixed-base scalar multiplication.

We don't want reviewers to have to analyze the security consequences
of *any* timing leaks,
so, at a lower level,
it's also important to perform field arithmetic in constant time.
BouncyCastle doesn't do this:
it implements prime-field arithmetic on top of a big-integer library
that computes the size of each integer at run time.
But why didn't BouncyCastle copy constant-time arithmetic mod 2^{255}−19
from previous implementations,
as a fast path for Ed25519 separate from the existing big-integer code?

One possible answer is along the following lines:
"A crypto library should support 82 elliptic curves
but, for simplicity, shouldn't separately implement 82 finite fields."
I would say that, for simplicity and security,
a crypto library should *not* support 82 elliptic curves.
I think we've made some progress in convincing the community of this.
There has also been considerable progress in building automatic generators for finite-field code,
although there is still much more work to be done.

[**2019.11.01 correction:**
BouncyCastle
*does* have a separate implementation of
arithmetic
modulo 2^{255}−19,
and higher-level implementations of
X25519
and
Ed25519
on top of this.
BouncyCastle's EdDSA implementation consists of
this Ed25519 implementation and a similar implementation for Ed448.
These implementations are designed to run in constant time
(except for clearly marked `Var` subroutines applied to public data).
Looks like another win for EdDSA.
BouncyCastle has also been making some progress
in removing variable-time code elsewhere:
for example, for ECDSA,
BouncyCastle 1.59 removed some variable-time array lookups,
although it is still using the variable-time big-integer library.
Thanks to Peter Dettman for the information.]

More fundamentally,
why didn't BouncyCastle call existing high-speed code in other languages
for arithmetic mod 2^{255}−19, or for the full Ed25519 operations?
Sure, some people might want to use BouncyCastle on platforms
that don't have this code and that don't make it easy to install this code,
but calling the code whenever it *is* available
would give BouncyCastle much better Ed25519 speeds—and,
as a side effect, would stop timing attacks.

The Ed25519 team could have provided a constant-time Java implementation from the outset.
To me this feels like the wrong direction:
every extra piece of crypto code is an extra problem for reviewers and
shouldn't be created in the first place unless there's a clear argument that it's needed.
A few implementations of a few cryptographic primitives have been computer-verified,
and improved tools will cover much more in the foreseeable future,
but verifying a Java implementation
is surely a lower priority than verifying a faster implementation that can be called from Java *and* from many other languages.

I see cryptography as a basic operating-system service, just like access to network cards. We don't, and we shouldn't, write network-card drivers in many different application languages; we write them in a few network-card-driver languages, and on top of this we provide simple networking abstractions to the designers of other languages, who in turn provide various higher-level networking abstractions to applications. There's much more to be said about the design of network-card-driver languages, and about the design of languages for cryptography, but that's a topic for another blog post.

[2022.01.09 update: Updated links above.]