The cr.yp.to blog



2019.10.24: Why EdDSA held up better than ECDSA against Minerva

"Minerva attack can recover private keys from smart cards, cryptographic libraries", says the ZDNet headline. "The Czech team found a problem in the ECDSA and EdDSA algorithms used by the Atmel Toolbox crypto library to sign cryptographic operations on Athena IDProtect cards."

The Minerva research team's web page claims "practical recovery of the long-term private key" from various "implementations of ECDSA/EdDSA in programmable smart cards and cryptographic software libraries". Concretely:

Wait a minute. If this was supposed to be an "ECDSA/EdDSA" attack, a shared "problem in the ECDSA and EdDSA algorithms", then why do these examples say just "ECDSA" and "ECDSA" and "ECDSA"?

Here's one theory you might come up with: "You're being selective in your summary. I looked through the Minerva page the day it was originally announced, and it also claimed to break the EdDSA implementation in libgcrypt."

Facts: The team did make this claim, but withdrew the claim two days later, for reasons that I'll explain below. I'm not being selective in my summary.

Here's another theory you might come up with: "Fundamentally, the attack applies equally to ECDSA and EdDSA. The only reason that there are more ECDSA attack reports is that ECDSA is more widely supported."

Facts: I looked at MatrixSSL, JDK, Crypto++, and wolfSSL/wolfCrypt. I noticed EdDSA (specifically Ed25519) implementations in everything except JDK. The Minerva team does not claim to break the EdDSA implementations in libgcrypt, MatrixSSL, Crypto++, and wolfSSL/wolfCrypt.

The Minerva web page also says that OpenSSL, BouncyCastle, BoringSSL, libtomcrypt, Botan, Microsoft CNG, mbedTLS, Intel IPP-Crypto, and 11 cards were "tested and found not vulnerable". I noticed EdDSA implementations in OpenSSL, BouncyCastle, BoringSSL, libtomcrypt, and Botan. In total, the Minerva team is claiming to break ECDSA in 4 or possibly 5 out of 13 crypto libraries supporting ECDSA, and EdDSA in 0 out of 9 crypto libraries supporting EdDSA. The bottom line is that the attack doesn't apply equally to ECDSA and EdDSA.

I'm not saying that it's impossible to implement EdDSA insecurely. Of course it's possible to implement EdDSA insecurely. Maybe there are people using insecure EdDSA implementations today. But the Minerva data points support the theory that EdDSA implementations are less likely to be broken than ECDSA implementations. This isn't an accident: it's how EdDSA was designed in the first place.

In the rest of this blog post, I'll take a closer look at the information exploited by the Minerva attack, and at the EdDSA implementations mentioned above. Thanks to Tanja Lange for help with this blog post.

Discrete logarithms. In ElGamal's signature system, the signer's public key is gx mod p. Here p is a standard prime, g is a standard generator mod p, and x is the signer's long-term signing key, chosen uniformly at random from {0,1,...,p−2}.

Computing (gx)(p−1)/2 mod p reveals x mod 2: it's 1 if x is even, or p−1 if x is odd. More generally, for any divisor q of p−1, one can check a guess for x mod q as follows: x is congruent to (say) 7 modulo q if and only if (gx)(p−1)/q = (g7)(p−1)/q. This isn't helpful when q is large: any particular guess for x mod q is very unlikely to be correct, and ruling out one guess still leaves many more possibilities.

The Pohlig–Hellman discrete-logarithm algorithm is faster, finding x in only about sqrt(q) operations where q is the largest prime dividing p−1, but let's assume that all prime divisors of (p−1)/2 are large. There are also discrete-logarithm algorithms that take time subexponential in the number of bits of p, but let's assume that p is so large that this isn't a threat. Shor's quantum discrete-logarithm algorithm takes polynomial time, but I'll assume throughout this blog post that the attacker doesn't have a quantum computer. It then seems very difficult to learn anything about x mod (p−1)/2 from gx mod p.

Linear equations in discrete logarithms. The first step in signing a message in ElGamal's system is to choose a "random number k, uniformly between 0 and p−1, such that gcd(k,p−1) = 1." The gcd condition forces k to be odd. We're assuming that all prime divisors of (p−1)/2 are large, so the gcd condition is very unlikely to have any other effect.

ElGamal's signatures reveal the coefficients M,R,S of a linear equation M=xR+kS modulo p−1. This is equivalent to an equation M=xR+kS modulo each of the prime-power divisors of p−1. In particular, M=xR+kS=xR+S modulo 2, revealing x mod 2 if R is odd; but we already knew x mod 2, as explained above.

The rest of the equation is M=xR+kS modulo (p−1)/2, or equivalently M=xR+kS modulo q for each prime-power divisor q of (p−1)/2. This doesn't seem helpful for the attacker, since k mod q could be almost anything. For each prime q dividing (p−1)/2, the attacker knows that k isn't 0 modulo q, i.e., that M isn't xR modulo q, which (assuming R is invertible modulo q) is the same as ruling out M/R as a possibility for x mod q; but, again, ruling out one guess for x mod q still leaves many more possibilities.

Another signature from the same signer reveals another equation M2=xR2+k2S2. ElGamal emphasizes that the signer must choose a new k2 for M2, rather than taking k2=k, since taking k2=k allows the attacker to solve these two linear equations for x. Repeating k is what destroyed the security of the Sony PS3 many years later.

Taking a new k2 seems to eliminate this problem: each possible x is compatible with some solution (k,k2) to the two equations, except for the invertibility issue described above. More generally, ElGamal concludes that these equations for any number of signatures are compatible with a huge number of possibilities for x as long as "each signature uses a different k".

Schnorr's use of a prime-order subgroup. There's a lot to be said about improvements from ElGamal's system to the Schnorr system and EdDSA. One of the improvements, simplifying the security analysis, is to take g as a generator of a subgroup of prime order q rather than as a generator of the entire group of order p−1.

Each Schnorr signature reveals the coefficients M,R,S of a linear equation M=xR+kS modulo q. Almost certainly S is nonzero modulo q, so let's simplify this linear equation to Y=xX+k modulo q, where Y=M/S and X=R/S. Schnorr actually obtains X and Y in a simpler way, eliminating all divisions and in particular eliminating the need to require gcd{k,q}=1. Schnorr takes k as a random number in the set {1,...,q}.

How NSA biased the linear equations in DSA. NSA's original DSA proposal uses prime-order subgroups as in Schnorr's system, revealing an equation Y=xX+k modulo q for each signature. Other DSA details are closer to ElGamal's system, in particular using divisions by k modulo q, so k is specified to be in the set {1,...,q−1}. The same comments apply to the subsequent NIST standards for DSA and ECDSA.

NSA's proposal included a specific prime q, namely 0xb20db0b101df0c6624fc1392ba55f77d577481e5, about 0.7 times 2160; and a specific prime p, which I won't repeat here, and which is around 0.8 times 2512. A subsequent update included another prime q, namely 0xc773218c737ec8ee993b4f2ded30f48edace915f, about 0.8 times 2160; and another 512-bit prime p. All of these primes were generated by hashing in a way that seems to make it very difficult to target particular choices of p and q. (As a side note, hashing was also used to generate shared parameters for various other cryptosystems, but was avoided in NSA's subsequent Dual EC DRBG proposal.) All of these primes are much too small to resist discrete-log attacks, but the problems I'm about to describe also show up for larger DSA parameters.

How do we generate a uniform random element k of {1,...,q−1}? If q is, say, 2160+1 then this is easy: generate 160 random bits, interpret the result as an integer, and add 1. But 2160+1 isn't prime, and, more to the point, isn't either of the random-looking primes q mentioned above.

The literature has many good answers to this question, and then it has NSA's answer, which I'll get to in a moment. One good answer, rejection sampling, is to generate a uniform random 160-bit integer (let's assume q is between 2159 and 2160) and start over if the integer is outside {1,...,q−1}. Another good answer, with more predictable performance, is to generate a uniform random 320-bit integer and reduce that integer modulo q. The resulting distribution isn't exactly uniform but it's so close that you'll never be able to tell the difference. Technically, this doesn't guarantee that k is in {1,...,q−1}, but you'd have to be amazingly unlucky for your 320-bit integer to be a multiple of q.

NSA's answer is to take a much shorter integer, a 160-bit (not 320-bit) hash of some new secret, and reduce the hash modulo q. Presumably the 160-bit hash is hard to distinguish from a uniform random element of {0,...,2160−1}, but the hash modulo q looks quite different from a uniform random element of {1,...,q−1}. For example, if q is NSA's 0xb20db0b101df0c6624fc1392ba55f77d577481e5, then any integer up through 0x4df24f4efe20f399db03ec6d45aa0882a88b7e1a is hit by two elements of {0,...,2160−1}, while any larger integer is hit by only one element of {0,...,2160−1}.

Interlude: Understanding the importance of distributions for security. Attack algorithms often work better for some inputs than for others. For a proper security analysis one has to be clear about the distribution of inputs. Changing the distribution can increase or decrease security levels, and can easily make the difference between a broken system and an unbroken system.

Here's a non-DSA example. How difficult is it, by known attacks, to find all of the prime factors of an integer N in {1,...,22048}? This question is incomplete, because the distribution of inputs N is unspecified. Here are four possibilities for the distribution:

These examples show how unsafe it can be to assume that different distributions have the same security level. Comparing the first, second, and third cases shows that reducing randomness can improve security. Comparing the second and third cases to the fourth case shows that reducing randomness can damage security.

Note that these examples rely critically on understanding which inputs are broken by various factorization algorithms. Don't believe people who try to replace serious algorithm analysis with some superficial rule that more randomness is always better. See also Section 11.1 of 2008 Koblitz–Koblitz–Menezes for more subtle examples.

Security analysis of DSA biases, part 1: lattice attacks. Back to DSA. The security of a single signature, and in particular the amount of information revealed by the equation Y=xX+k modulo q, can't be affected much by NSA's change in the distribution of k. An attack that works with a particular probability against NSA's uniform random 160-bit k also has at least half as much probability of working against ElGamal's uniform random nonzero k modulo q. The point is that NSA produces any particular integer modulo q with probability at most 1/2159 (assuming q is at least 2159), while ElGamal produces the same integer with probability 1/(q−1), which is at least 1/2160 (and even somewhat larger than this for the choices of q mentioned above). Whichever NSA choices of k are affected by the attack have at least half as much chance of appearing in ElGamal's choices of k. (This is an example of a "divergence argument".)

The same argument also says that NSA can't lose more than 2 bits of security for 2 signatures, more than 3 bits of security for 3 signatures, etc. But this is a useless guarantee for a signer generating, say, 1000 signatures.

ElGamal went beyond a single equation and considered the whole system of equations revealed by the signer: Y1=xX1+k1 modulo q, Y2=xX2+k2 modulo q, Y3=xX3+k3 modulo q, etc. Any particular k from NSA has nearly as much chance of having been generated by ElGamal, but the whole sequence (k1,k2,k3,...) from NSA is much less likely to be generated by ElGamal. (The "divergence" grows rapidly with the number of signatures.) What effect does this gap have on security?

What we know about NSA's (k1,k2,k3,...) is that each kj favors elements of {0,...,2160−1−q} over elements of {2160−q,...,2160−1}. Let's consider a simpler and larger bias, where each kj is guaranteed to be in a smaller range {0,...,2b−1}. The attacker knows that Yj=xXj+kj modulo q, i.e., that the vector (Y1,Y2,Y3,...) is the surprisingly short vector (k1,k2,k3,...) plus x(X1,X2,X3,...) plus a multiple of q in each coordinate. In other words, (Y1,Y2,Y3,...) is surprisingly close to the public lattice consisting of all multiples of (X1,X2,X3,...) plus all multiples of q in each coordinate. Finding the lattice vector surprisingly close to (Y1,Y2,Y3,...) immediately reveals x.

Howgrave-Graham and Smart introduced this lattice view of biased DSA in 1999, and reported being able to quickly find x from "only 30 signed messages" with 152-bit kj, using a simple algorithm to find the close lattice vector. Howgrave-Graham and Smart also pointed out that there's nothing special about having the kj range start from 0. For example, if one knows that k1 is in the interval {37*2152,...,38*2152−1}, then one can subtract 37*2152 from Y1 and from k1, reducing to the case that k1 is in the interval {0,...,2152−1}. Or one can reduce to the case that k1 is in the balanced interval {−2151,...,2151−1}, which is even better for the lattice attacks.

"It is important that no bits of the ephemeral keys are leaked for whatever reason", Howgrave-Graham and Smart wrote. If you assume that "important" means "important for security against known attacks", then you can object that Howgrave-Graham and Smart were overstating their results. They had reported an attack using 8 bits of each kj; this does not imply an attack using, say, 4 bits of each kj. Followup papers plugged fancier lattice algorithms into the same ideas, showing that an attacker knowing 4 bits of each kj could quickly find x, but you can still object that this doesn't apply to NSA's leak of roughly 0.6 bits of information about each kj.

On the other hand, as a cryptosystem designer, I don't want security reviewers to have to spend time analyzing fancy attacks that I could have eliminated by tweaking my design. Security review is the best protection that users have against attacks; security failures come primarily from having security review spread too thin. From this perspective, it is important to eliminate the leak of information about k.

Security analysis of DSA biases, part 2: Bleichenbacher's attack. In October 2001, NIST issued a new DSA standard with a change notice saying that an "unpublished attack" had broken NSA's biased DSA randomness using a "workfactor of 264" and "222 known signatures". This statement doesn't make clear what the units are for measuring this "workfactor", i.e., how complicated each of the 264 steps is. The statement doesn't even make clear whether the attack is for NSA's 512-bit primes (which were already known to allow feasible discrete-logarithm attacks), or for 1024-bit primes, or both.

Earlier news had attributed the attack to Bleichenbacher, and had quoted "Edward Roback, chief of the Computer Security Division in NIST's Information Technology Laboratory" as saying that "those who are using DSA can continue to use it with confidence that DSA signatures done under the present standard will remain secure for many more years". I'm not aware of any publications explaining why NIST thought that attackers couldn't reach a "workfactor of 264".

Bleichenbacher privately reported his attack to an IEEE working group in November 2000, but as far as I know didn't immediately post the slides or any other attack details. A small amount of information about the attack appears in Section 5 of a 2001 report by Vaudenay. Bleichenbacher's slides (TIFF; here's a PDF version) were posted in late 2004. The first serious writeup I saw was a 2013 paper by De Mulder, Hutter, Marson, and Pearson.

Here's a concrete example to illustrate some of the techniques. Bleichenbacher numerically measures the bias of k as the average over k of the complex number exp(2πik/q). If k were uniformly distributed modulo q then this complex number would be equally spaced around the unit circle in the complex plane and the average would be 0. However, k isn't uniformly distributed modulo q. Let's take NSA's original q, around 0.7 times 2160; a short calculation shows that the bias of k is around 0.05+0.21i.

If k1 and k2 are independent identically distributed variables then the bias of k1+k2 is the average of exp(2πi(k1+k2)/q) = exp(2πik1/q)exp(2πik2/q), which factors as the bias of k1 times the bias of k2. Similarly, the bias of k1−k2 is the bias of k1 times the complex conjugate of the bias of k2.

Consider many equations Yj=xXj+kj modulo q from, say, 240 signatures under one key. Assume that each Xj is reduced modulo q to the range {0,1,...,q−1}. Sort the equations into increasing order of Xj. Consider a difference of two adjacent equations: Yj+1−Yj=x(Xj+1−Xj)+(kj+1−kj). Presumably each difference Xj+1−Xj will be on the scale of q/240; a more precise analysis would look at the distribution of these differences. Meanwhile the difference kj+1−kj would have bias (0.048+0.21i)(0.048−0.21i), around 0.047, if kj+1 and kj were independent; presumably the sorting of Xj doesn't noticeably change this bias.

Repeat this sort-and-difference process three times, reducing each X first to about 120 bits with k bias around 0.047, then to about 80 bits with k bias around 0.0472, then to about 40 bits with k bias around 0.0474, i.e., around 2−17.6. Write the resulting equations as Y'j=xX'j+k'j, where each X'j has about 40 bits, and each k'j has bias around 2−17.6.

Say we have a rough guess G for x, and we compute the average A(G) of exp(2πi(Y'j−GX'j)/q), i.e., the average of exp(2πi(x−G)X'j/q) exp(2πik'j/q). For most choices of G, this is an average of 240 random-looking points on the unit circle, so presumably it has size around 2−20. However, if G is within 2115 of x then the differences (x−G)X'j are bounded by 2155 (this is where the smallness of X'j is important), so exp(2πi(x−G)X'j/q) is close to 1, so presumably A(G) is close to the bias of k'j, around 2−17.6, which is noticeably bigger than 2−20.

Now try, e.g., G=0, G=q/245, G=2q/245, G=3q/245, etc. It's easy to use a fast Fourier transform to compute A(G) simultaneously for all such G with only about 250 multiplications of complex numbers. Hopefully the largest A(G) value will come from some G close to x, revealing the top bits of x, and then one can repeat to find the remaining bits of x. On the other hand, maybe 2−20 is too much noise compared to 2−17.6. One can try several of the largest A(G) values, or reduce the noise by using more equations, as I'll now explain.

How can we generate more equations without starting from more signatures? One approach is to go beyond adjacent differences Xj+1−Xj, also considering Xj+2−Xj and Xj+3−Xj and so on up to some size limit. Presumably there are about 250 differences below, say, q/230, and then about 250 differences below q/280 at the next level, and then about 250 differences below q/2130 at the next level. It's tricky to analyze the randomness of lists that include, e.g., the difference X1−X0 and the difference X2−X1 and the difference X2−X0, which is the sum of X1−X0 and X2−X1; one should carry out experiments to see how well the algorithm actually works. (One should carry out experiments even for algorithms where someone claims to have a complete analysis.)

Another approach is to start with only 234 signatures and consider all sums Xj+Xi mod q. Presumably there will be almost 267 different sums, and more than 250 cases where two sums are within q/280 of each other, and more than 250 differences below q/2130 at the next level. As Bleichenbacher notes in his slides, there are standard techniques to find nearby sums with far less memory than storing and sorting all the sums. (For each j, build a low-memory generator that outputs Xj+Xi for all i in increasing order, and then merge the outputs of these generators. Bleichenbacher credits 1981 Schroeppel–Shamir. My sortedsums paper cites a 1973 Knuth exercise crediting William S. Brown. It wouldn't be surprising if there are even earlier references.)

Or, starting with fewer signatures, we could consider all sums of four X values, and directly look for 250 sums within q/2130 of each other. This is essentially the same as looking for 250 sums that collide on the top 130 bits. If you think it's obvious how quickly such searches can be carried out, then try to explain why the 2010 Howgrave-Graham–Joux subset-sum algorithm has exponent around 0.337n, and why the 2011 Becker–Coron–Joux subset-sum algorithm has exponent around 0.291n, and why the 2019 Esser–May subset-sum algorithm has exponent around 0.255n.

An even larger cloud of attacks. Back in October 2000, a month before Bleichenbacher's IEEE presentation, Blum, Kalai, and Wasserman (BKW) had introduced an algorithm to find a secret vector x of integers modulo 2, given noisy dot products xX1+k1, xX2+k2, etc. Each Xj is a random public vector of the same length as x, and each kj is a bit set with some specified probability smaller than 1/2.

The BKW algorithm sorts Xj according to their leading bits, and then computes successive differences Xj+1−Xj, obtaining noisier dot products x(Xj+1−Xj)+(kj+1−kj). The BKW algorithm repeats this sort-and-difference process to clear more and more bits of X, at the expense of more noise. Eventually it finds many noisy dot products x(0,0,...,0,1)+k'j, takes the majority vote of these dot products to see the last bit of x, and similarly finds the other bits of x.

The problem considered by BKW, where x is a vector modulo 2 and some ki bits are set, is called the "learning parity with noise" (LPN) problem. In the "learning with errors" (LWE) problem, x is a vector modulo q and each ki is a small integer. LWE for length-1 vectors, the target of Bleichenbacher's attack, is called the "hidden-number problem" (HNP), an example of how scientific terminology can be selected to minimize the amount of information communicated to the uninitiated.

Followups to the BKW paper adapted the BKW idea to the LWE problem and to various lattice problems, and eventually added further speedup ideas that Bleichenbacher had used, such as fast Fourier transforms ("fast Walsh transforms" in the mod-2 case) and non-adjacent differences. An analysis in 2015 by Herold, Kirshanova, and May showed that one of the latest BKW variants sometimes had better asymptotics than basis reduction as an attack against reasonable-looking LWE parameters for lattice-based cryptography; look at Figure 1 in the paper to get a quick idea of how close the competition is. Papers are continuing to appear with basis-reduction speedups and with speedups for various BKW/Bleichenbacher-type algorithms. The literature is complicated, and is continuing to become more complicated.

As a cryptosystem designer, I'm much happier with the literature on conservative prime-field ECC. There's a much simpler state-of-the-art attack algorithm that includes all known optimizations. There's a convincing high-precision analysis of the performance of the algorithm. The literature has clear explanations of quantitative obstacles to other attack ideas such as pairings and decomposition, and has no ideas for getting around these obstacles. But NSA's DSA biases spoil the simplicity of the attack picture, forcing the security reviewer to consider a much more complicated collection of worrisome attack strategies.

Overconfidence in limits on the number of signatures. NIST's 2001 change notice for the DSA standard provided two options for users to "defend" against Bleichenbacher's attack. The first option was for users to continue using NSA's biased DSA randomness but sign at most 221 signatures per key, half of the "222 known signatures" used by the "unpublished attack".

Why did NIST think that a limit of 221 signatures would be safe? Was there some quantitative analysis saying that reducing 222 signatures to 221 signatures would increase the attack cost to something clearly infeasible? And what made NIST confident that a new "unpublished attack" was optimal?

Bleichenbacher needs enough equations to compensate for the noise. Sorting equations and then subtracting adjacent equations, as in the BKW paper, doesn't expand the number of equations available. But Bleichenbacher's slides were explicitly generating far more equations as sums of small non-adjacent subsets of the original equations. I don't understand how anyone could have been confident regarding the tradeoffs here, even before seeing this decade's breakthroughs in subset-sum algorithms.

Here are further examples to illustrate the lack of maturity of the analysis of these tradeoffs. In 2012, Damgård and Park proposed some cryptosystems based on the supposed hardness of the following example of LPN. There is a secret n-bit vector x, where n is 768. The attacker sees a limited number of noisy dot products xX+k, where X is public and the noise k is 1 with probability 1.5%. The limit is a small constant times n, where the constant depends on the cryptosystem variant. The original version of the paper stated that BKW needs as input a gigantic number of "samples" xX+k, while, for the paper's cryptosystems, "the adversary only gets O(n) samples, which is the hardest case, so this suggests that relatively small values of n might be sufficient for security". The paper was then revised to account for "recent attacks on LPN that invalidated the conclusions of the previous version". See, e.g., my Ring-LPN attack paper with Lange.

In the LWE context, a 2014 paper by Albrecht, Faugère, Fitzpatrick, and Perret stated that "one of the main remaining obstacles for applying the BKW algorithm to cryptographic constructions based on LWE is that it requires an unbounded number of samples to proceed. Lifting this requirement, if only heuristically, is hence a pressing research question." Followup papers then pointed out that one could use far fewer samples.

Safe fixes for DSA. Instead of trying to pretend that the limits of these attacks are clear, let's eliminate the applicability of the attacks to DSA, by fixing the problem at its source. The problem is that—for NSA's choices of q around, say, 0.7 or 0.8 times 2160—taking 160 bits of randomness modulo q produces a biased result.

If we modify q, moving it extremely close to 2160 or to 2159, then the bias rapidly disappears. This is also naturally accomplished by ECDSA if we choose a prime p extremely close to 2160 (rather than, say, the Brainpool "random" p) and choose an elliptic curve with a power-of-2 cofactor such as 1 or 2 or 4 or 8: q will then be extremely close to 2160 or 2159 or 2158 or 2157 respectively.

The bias also rapidly disappears if we take more bits of randomness. As I mentioned above, if q is between 2159 and 2160, then you'll never be able to tell the difference between a uniform random 320-bit integer mod q and a uniform random element of {0,...,q−1}. This was NIST's second option for users to protect themselves: concatenate two 160-bit hashes to obtain a 320-bit integer.

Ed25519 has both of these defenses. It uses Curve25519, with a subgroup of prime order extremely close to 2252. It specifies k as a deterministic, easily testable SHA-512 hash of a secret and the message being signed. Of course, Ed25519 also moves from NSA's dangerously small cryptosystem sizes up to a safe security level.

Timing attacks against DSA. The reason this isn't the end of the story is that mathematical biases in the construction of the secret k aren't the only way to leak bits of k. For example, a 2011 paper by Brumley and Tuveri steals "the private key of a TLS server where the server authenticates with ECDSA signatures". The attack was demonstrated against OpenSSL using the NIST B-163 curve. The choice of q here isn't the issue; q is extremely close to a power of 2.

The paper title is "Remote timing attacks are still practical". OpenSSL computed the kth multiple of an elliptic-curve point with an algorithm whose time revealed the position of the top bit in k: the algorithm took the maximum time if k was between 2b−1 and 2b, less time if k was between 2b−2 and 2b−1, etc. Sometimes timings would show, e.g., that k is below 2b−8, putting the attacker into the situation of the fast Howgrave-Graham–Smart attack.

This was just one in a long line of papers using timing attacks to break DSA, ECDSA, etc. There are, for example, papers extracting keys more efficiently from timing information, and there are papers collecting such precise timing information that key extraction becomes trivial, and now there's Minerva applying timing attacks to a remarkable number of different targets.

How to stop timing attacks. A programmer who knows enough about CPU timing can systematically write software that avoids all data flow from secrets to timings. For example, all of the C and assembly implementations from the Ed25519 team are written in this way, as are the NaCl and BearSSL crypto libraries.

(One can also randomize computations to try to obscure the information leaked through timings: e.g., add a random 512-bit multiple of q to k before scalar multiplication. However, it's very difficult for the implementor, and for reviewers, to figure out how secure randomization is against timing attacks. A better argument for randomization is that attackers sometimes have access to more invasive side channels such as power measurements: truly constant power seems very difficult to achieve, and randomization seems to be our best bet. I advise against randomization as a replacement for constant-time software, but I don't object to randomization as an extra line of defense whenever the user can afford it.)

Saying that implementors can write constant-time software doesn't mean that they do write constant-time software. I see three basic reasons for this. First, there's a huge amount of variable-time software in existing crypto libraries, and getting rid of all of this software is a correspondingly huge effort. Second, many popular cryptographic functions produce unacceptable slowdowns when they're implemented in constant time. Third, and most importantly, many popular cryptographic functions are much simpler to implement in variable time.

I noticed many years ago that the evolution of computer architecture was making it easier and easier to resolve the tension between security and speed: there's a synergy between what I want to do to avoid timing leaks and what chip designers want to do for performance. I've set various cryptographic speed records using constant-time software. Sometimes there are still some minor speedups possible with variable-time software, but the tension is much smaller than it was, making it hard for people to argue that these speedups should outweigh security—especially since computation in general is becoming less and less expensive. I'm not saying that all cryptographic designs have such small penalties for constant-time implementations; having the freedom to throw away NSA's designs has been an important part of this work.

I've also tried to resolve the tension between security and simplicity. Software for X25519, Ed25519, etc. does everything the typical user needs, is much simpler than traditional software aiming at the same security goals, and would gain very little in simplicity from using variable-time algorithms.

But now consider a crypto library that for years has included variable-time software for NSA's elliptic-curve cryptosystems, and that wants to keep supporting those cryptosystems for whatever reason, but that also wants to add support for Ed25519. Different notions of simplicity then produce different results. It's simple for the library to copy an existing constant-time Ed25519 implementation; this implementation might be close to the simplest possible code for Ed25519; but the library author might think that there are simpler ways to simultaneously support Ed25519 and NSA's systems.

Case study #1: libgcrypt. In libgcrypt, there's a function _gcry_mpi_ec_mul_point that computes kP on an elliptic curve E, given k and P and E as inputs. The main loop inside the function has length mpi_get_nbits(k), a simple example of what one must not do in constant-time software. Actually, the function has different loops for different types of curves, and a complete description of the loop length is more complicated, but what matters is that a smaller k runs measurably faster.

The real fix, the constant-time approach, would start by changing the interface to replace mpi_get_nbits(k) with a maxscalarbits specified by the caller. But this would require going through dozens of functions that call _gcry_mpi_ec_mul_point and figuring out the appropriate maxscalarbits for each. This is an example of tension between simplicity and security. Even a small tension can push implementors to do something insecure. The tension increases as more and more elliptic-curve protocols are implemented on top of the existing _gcry_mpi_ec_mul_point interface. It's not surprising that libgcrypt still has the old variable-time interface.

At this point one might think that, well, libgcrypt is leaking the number of bits of k, so the standard lattice attacks apply, whether the protocol is ElGamal or Schnorr or DSA or ECDSA or EdDSA. This is also what Minerva claimed at first. But let's take a closer look at how large k is.

The DSA and ECDSA standards say that k is between 0 and q. By now there are several different standard methods for generating k. libgcrypt's _gcry_dsa_gen_k uses one of these methods, namely rejection sampling, which I described above.

EdDSA uses a different name "r" for k, and says that "r" is SHA-512 applied to various specified inputs. This is what libgcrypt implements: hashing the inputs with _gcry_md_hash_buffers, then reading the 64-byte result into a multiprecision integer r with _gcry_mpi_set_buffer, then calling _gcry_mpi_ec_mul_point.

Now _gcry_mpi_ec_mul_point leaks whether "r" has 512 bits, 511 bits, 510 bits, etc. But this is useless for the attacker. Say "r" has only 500 bits; the attacker still has no idea what "r" mod q might be. This is why Minerva works against libgcrypt's implementation of ECDSA and doesn't work against libgcrypt's implementation of EdDSA.

The situation would have been different if libgcrypt had inserted an extra line of code to reduce modulo q before calling _gcry_mpi_ec_mul_point. But this wouldn't have been as simple, so it's not such a surprise that this didn't happen. (The extra reduction work would have produced a speedup, but copying existing constant-time Ed25519 software would have produced a much larger speedup.)

I described this scenario in a CFRG message five years ago:

The best protection we know is to generate much longer nonces—such as the 512-bit nonces in Ed25519. Then the system isn't broken by a timing attack revealing the nonce length. Setting the high bit of such a long nonce is also perfectly safe.

Of course, an implementor can still get in trouble by (1) reducing these nonces and then (2) leaking the lengths of the reduced nonces through a variable-time scalarmult method. So, as another line of defense, we also choose curves that support simple, fast, constant-time ladders.

libgcrypt is in the scenario of the first paragraph: the libgcrypt timing reveals the number of leading zeros in Ed25519's 512-bit hash, but this doesn't break the system. libgcrypt doesn't reduce the nonces, so it doesn't get into the trouble described in the second paragraph.

There have been many other timing-attack vulnerabilities in libgcrypt, and clearly there will be more. We have to throw away variable-time crypto code without waiting for attacks to be demonstrated. For example, libgcrypt should use the constant-time ladders supported by Curve25519. But this doesn't mean that Minerva broke libgcrypt's Ed25519 implementation: libgcrypt was saved by another Ed25519 feature, the double-size hash.

Case studies #2, #3, #4, #5, #6, #7, and #8. A quick skim suggests that BoringSSL, Botan, Crypto++, libtomcrypt, MatrixSSL, OpenSSL, and wolfSSL/wolfCrypt all started with upstream Ed25519 implementations that advertise being constant-time, such as ref10 from the Ed25519 team, donna from Adam Langley and Andrew Moon, tweetnacl from the TweetNaCl team, and fiat from the Fiat Crypto team.

Of course, a quick skim isn't the same as a serious security review. The sheer volume of code is horrifying for the reviewer. Unlike the upstream implementations, these seven crypto libraries don't declare time variability to be a bug (unless the bug is shown to be exploitable and, in the case of OpenSSL, remotely exploitable), so a reviewer who finds time variability isn't automatically rewarded with an acknowledgment of a valid bug report. The libraries have so much existing variable-time code that they're scared to declare time variability to be a bug.

So I'm certainly not guaranteeing that timing attacks will fail against all these Ed25519 implementations. But, given how much of the code in these Ed25519 implementations is coming from people who say they were engineering it to be immune to all timing attacks, it wouldn't be surprising if it is immune to all timing attacks.

Does the prevalence of constant-time implementations of a primitive mean that the primitive encourages such implementations? Superficial answer: Yes. New implementations can simply copy the existing constant-time implementations. This is what these seven libraries did, protecting the users. (But, again, one has to verify that the upstream implementations really are constant-time, and that this wasn't screwed up by whatever modifications happened as part of the library integration.)

Deeper answer: Not necessarily. Maybe the primitive is painful to implement in constant time. Maybe the first implementors suffered through doing this, but new implementations will cut corners.

In the case of Ed25519, it's clear from a detailed design comparison that Ed25519 is much friendlier to constant-time implementations than NSA ECC is. This doesn't guarantee constant-time implementations, but it does make them more likely.

Case study #9. Finally, I looked briefly at BouncyCastle, which wrote everything from scratch in Java. BouncyCastle is full of variable-time code. I don't immediately see why BouncyCastle wasn't broken by Minerva. BouncyCastle uses scalars smaller than q for scalar multiplication for ECDSA and EdDSA, unlike libgcrypt, so BouncyCastle isn't protected by the double-size hash.

If BouncyCastle's implementations of ECDSA and EdDSA are broken by timing attacks then the tallies turn into 13 crypto libraries supporting ECDSA with timing attacks on 5 or possibly 6, and 9 crypto libraries supporting EdDSA with timing attacks on just 1. This still sounds like a big win for EdDSA overall, but it's interesting to contemplate ways that we could have saved BouncyCastle too.

Why didn't BouncyCastle use a simple constant-time ladder? This is clear from the code: BouncyCastle has many complications that are aiming for speed at the expense of simplicity. In particular, BouncyCastle sacrifices a small amount of code simplicity to reduce k mod q, and sacrifices even more code simplicity to use precomputed tables of various multiples of the fixed base point P. These steps obviously make signatures several times faster.

Precomputed tables are still compatible with constant-time code, as shown in previous Ed25519 implementations. Why didn't BouncyCastle use this approach? This isn't as clear. Variable-time fixed-base scalar multiplication is at best marginally faster than constant-time fixed-base scalar multiplication. Perhaps there haven't been enough tutorials showing how to implement constant-time fixed-base scalar multiplication.

We don't want reviewers to have to analyze the security consequences of any timing leaks, so, at a lower level, it's also important to perform field arithmetic in constant time. BouncyCastle doesn't do this: it implements prime-field arithmetic on top of a big-integer library that computes the size of each integer at run time. But why didn't BouncyCastle copy constant-time arithmetic mod 2255−19 from previous implementations, as a fast path for Ed25519 separate from the existing big-integer code?

One possible answer is along the following lines: "A crypto library should support 82 elliptic curves but, for simplicity, shouldn't separately implement 82 finite fields." I would say that, for simplicity and security, a crypto library should not support 82 elliptic curves. I think we've made some progress in convincing the community of this. There has also been considerable progress in building automatic generators for finite-field code, although there is still much more work to be done.

[2019.11.01 correction: BouncyCastle does have a separate implementation of arithmetic modulo 2255−19, and higher-level implementations of X25519 and Ed25519 on top of this. BouncyCastle's EdDSA implementation consists of this Ed25519 implementation and a similar implementation for Ed448. These implementations are designed to run in constant time (except for clearly marked Var subroutines applied to public data). Looks like another win for EdDSA. BouncyCastle has also been making some progress in removing variable-time code elsewhere: for example, for ECDSA, BouncyCastle 1.59 removed some variable-time array lookups, although it is still using the variable-time big-integer library. Thanks to Peter Dettman for the information.]

More fundamentally, why didn't BouncyCastle call existing high-speed code in other languages for arithmetic mod 2255−19, or for the full Ed25519 operations? Sure, some people might want to use BouncyCastle on platforms that don't have this code and that don't make it easy to install this code, but calling the code whenever it is available would give BouncyCastle much better Ed25519 speeds—and, as a side effect, would stop timing attacks.

The Ed25519 team could have provided a constant-time Java implementation from the outset. To me this feels like the wrong direction: every extra piece of crypto code is an extra problem for reviewers and shouldn't be created in the first place unless there's a clear argument that it's needed. A few implementations of a few cryptographic primitives have been computer-verified, and improved tools will cover much more in the foreseeable future, but verifying a Java implementation is surely a lower priority than verifying a faster implementation that can be called from Java and from many other languages.

I see cryptography as a basic operating-system service, just like access to network cards. We don't, and we shouldn't, write network-card drivers in many different application languages; we write them in a few network-card-driver languages, and on top of this we provide simple networking abstractions to the designers of other languages, who in turn provide various higher-level networking abstractions to applications. There's much more to be said about the design of network-card-driver languages, and about the design of languages for cryptography, but that's a topic for another blog post.


Version: This is version 2019.11.01 of the 20191024-eddsa.html web page.