[In answer to: "Is the NSA aware of any weaknesses in the DES or the DSS? The RSA?"] We are unaware of any weaknesses in the DES or the DSS when properly implemented and used for the purposes for which they both are designed.—National Security Agency, June 1992

Earlier this month a new paper by Naomi Benger, Joop van de Pol, Nigel Smart, and Yuval Yarom hit the news. The paper explains how to recover secret keys from OpenSSL's implementation of ECDSA-secp256k1 using timing information from "as little as 200 signatures"; ECDSA-secp256k1 is the signature system used by Bitcoin. The timing information is collected by an attack process running on the same machine, but the process doesn't need any privileges; I don't see any obstacle to running the attack process in a separate virtual machine. Earlier papers by Yarom and Katrina Falkner and Yarom and Benger had explained how to carry out similarly efficient attacks against various implementations of RSA and binary-field ECDSA.

These attacks are what I call "cache-timing attacks": they exploit data flow

- from secrets to load/store addresses and
- from load/store addresses to attacker-visible differences in timing between different addresses.

For comparison, conventional timing attacks exploit data flow

- from secrets to the program counter (i.e., the instruction address as a function of time) and
- from the program counter to attacker-visible differences in timing between different instruction addresses.

In both cases the second part of the data flow is built into chips, but the first part is built into the software.

Did the software designers have to allow data flow from secrets to addresses?
"Obviously not!" say the theoreticians.
"Everybody knows that *any* computation
using branches and random access to data
can be efficiently simulated
by a computation that accesses only
a predefined public sequence of instructions
and a predefined public sequence of memory locations.
Didn't you take a course in computational complexity theory?
If the software designers had done a better job
then this attack would never have worked."

I have a different view. I blame this attack on the ECDSA designers. Every natural implementation of ECDSA makes heavy use of secret branches and secret array indices. Eliminating these secrets makes the code much more complicated and much slower. (The theoreticians are blind to these problems: their notion of "efficient" uses an oversimplified cost metric.) The ECDSA designers are practically begging the implementors to create variable-time software, so it's not a surprise that the implementors oblige.

Do elliptic-curve signature systems have to be designed this way? No, they don't. There are other elliptic-curve signature systems that are much easier to implement in constant time.

The broader picture is that elliptic-curve signature systems vary considerably in simplicity, speed, and security. ECDSA does a bad job on all three axes. It's unnecessarily difficult to implement correctly. It's full of unnecessary traps that can compromise security. It's unnecessarily slow no matter how it's implemented, and it's even slower if it's implemented securely. These performance problems encourage implementors to downgrade to breakable security levels, or to disable signatures entirely.

In this blog post I'll review the choices made in designing elliptic-curve signature systems, and explain the consequences of those choices. This blog post turned out to be longer than I was expecting at first, because there are so many different things that ECDSA does wrong.

Thanks to Tanja Lange for help with this blog post.

Yes.
A few years ago a team of cryptographers (including me)
designed and implemented
Ed25519,
a state-of-the-art high-security elliptic-curve signature system.
Ed25519
fits signatures into 64 bytes;
fits public keys into 32 bytes;
verifies more than 18000 signatures per second
on a three-year-old Intel laptop (2-core 2.1GHz Core i3-2310M);
signs even more messages per second;
and takes negligible time to generate a new key.
These performance results are from software with
*no* data flow from secrets to branch conditions and
*no* data flow from secrets to load/store addresses.

The numbers here come from
eBACS,
a cryptographic benchmarking site that Lange and I run;
specifically, from the
signature results
for a computer named `bridge`.
The signing and verification benchmarks are for short messages.
For very long messages the bottleneck is cryptographic hashing,
but of course it's relatively easy for hashing to keep up with the network.
All of the benchmarked implementations and benchmarking tools
are publicly available,
so it's easy for anyone to check the numbers.

RSA provides even faster signature verification than ECC. But RSA has much slower signing, much slower key generation, much larger signatures, and much larger public keys. It's hard to find applications where this is a sensible tradeoff, and it's easy to find applications where RSA's poor performance profile has compromised security. RSA also has many implementation traps, and even a state-of-the-art RSA implementation is more worrisome from a security perspective than ECC is, but that's a topic for another blog post.

Once upon a time I thought
that elliptic-curve signature verification
was too slow for some verification-intensive applications.
The actual story is that typical *implementations of ECDSA*
are too slow for some applications;
but, again, ECDSA is very far from a state-of-the-art elliptic-curve signature system.

I have a
survey paper
that explains what you should do if you're considering RSA signatures.
In particular,
you should *not*
use the original system proposed by Ron Rivest, Adi Shamir, and Len Adleman;
there were many important improvements since then,
notably in a
followup paper
by Michael Rabin.
Some of these improvements
(such as small exponents, which are critical for fast signature verification)
are already included in a typical RSA description,
but most of them aren't.
Overall today's use of RSA is even farther from the state of the art
than today's use of ECC.

The history began with a signature system proposed by Taher ElGamal in 1985, and continued with a much better signature system proposed by Claus Schnorr in 1990. ECDSA was later than Schnorr but is essentially ElGamal's system, ignoring most of Schnorr's improvements. EdDSA and Ed25519 are from 2011 and use Schnorr's improvements, together with some new improvements.

The following description starts with ElGamal signatures and then makes a series of small changes to reach Schnorr signatures. [This classification of changes comes from a talk "The security impact of a new cryptographic library" that Lange and I presented to Qualcomm's security group. The description here is more detailed than our slides.]

**The ElGamal signature system: B ^{H(M)} = A^{R} R^{S}.**
ElGamal didn't use elliptic curves;
he used multiplication modulo a prime number.
Saying that (R,S) is an ElGamal signature of a message M under public key A
means that R and S are integers between 0 and p−2
and that B

**Hashing the exponent: B ^{H(M)} = A^{H(R)} R^{S}.**
ElGamal's system still works
if the exponent R is replaced by a hash of R.
This is cheap (hashing is fast)
and is a serious obstacle to any attack strategies
that rely on choosing a particular exponent of A.

**Prime-order subgroups: B ^{H(M)} = A^{H(R)} R^{S}.**
Same equation,
but now S is between 0 and q−1,
where q is a prime divisor of p−1,
and B has order q.
This simplifies the security analysis.
This also saves space
(for example, 2

**Elliptic curves: H(M)B = H(R)A + SR.**
At this point it's also easy to state an elliptic-curve signature system.
ECC replaces multiplication mod p
with addition on an elliptic curve.
A, B, R are now elliptic-curve points,
while H(M), H(R), S are integers.
I'll continue describing multiplicative systems,
but I'll also mention the ECC versions in parallel.

**Two exponents: B ^{H(M)/H(R)} = A R^{S/H(R)}; (H(M)/H(R))B = A + (S/H(R))B.**
It's safe to assume that nobody will ever generate an H(R) divisible by q,
the prime order of the standard base B.
The verifier divides the three exponents H(M),H(R),S
by H(R) modulo q, obtaining exponents H(M)/H(R),1,S/H(R);
the cost of these divisions is outweighed by the speedup in exponentiation.
This change can't have lost any security:
if B

**Relabel: B ^{H(M)/H(R)} = A R^{S}; (H(M)/H(R))B = A + SR.**
Since the verifier is going to use S/H(R),
the signer might as well send S/H(R) rather than sending S,
and we can simplify the description by renaming S/H(R) as S.
This change can't have lost any security:
any forgery (R,S) against this system
implies a forgery (R,S H(R)) against the previous system.

**Merge the hashes: B ^{H(R,M)} = A R^{S}; H(R,M)B = A + SR.**
Both the signer and the verifier are using H(M)/H(R) as a black-box function of (R,M).
There's no reason for the division structure;
we can simply hash R and M together,
creating an obstacle for any attacks
that were relying on the separation between H(M) and H(R).

In particular, imagine that the attacker manages to compute an innocent-looking message M and a dangerous message M' with H(M) = H(M'), and manages to convince the signer to sign M. In the previous system, the attacker has now succeeded at a forgery: the same (R,S) is a valid signature of M'. This doesn't work in the new system: H(R,M) won't be the same as H(R,M'). In other words, this change has made the system resilient to collisions.

Of course, the attacker could have taken a legitimate signature (R,S) of M and tried to find M' with H(R,M) = H(R,M'), but this attack requires "second preimages" of H rather than collisions. The number of targets M is now limited by the number of legitimate signatures performed, whereas previously the number of targets was limited only by the attacker's computation.

Standard practice is to throw away cryptographic hash functions for which collisions have been found (e.g., MD5), or for which collisions can be found by a feasible computation (e.g., SHA-1). If collisions are never found then collision resilience has no obvious benefit. But it's still comforting to have this extra level of protection. Sometimes extra protections have extra costs, raising difficult questions of which protections are most valuable given the user's performance budget, but in this case the extra protection didn't cost anything: in fact, it made the system simpler and faster.

**Eliminate divisions: B ^{S} = R A^{H(R,M)}; SB = R + H(R,M)A.**
In all of the previous systems,
the signer's computation of S
involved divisions modulo q.
Fast "extended Euclid" divisions
are a standard feature of big-number libraries,
but these divisions take

It's normal to simplify and accelerate modular arithmetic
by choosing "nice" primes p, such as 2^{255}−19 or 2^{521}−1 for ECC.
I'll explain later how to perform fast constant-time multiplications
modulo nice primes.
But q doesn't naturally end up being a nice prime.
The number of points on an elliptic curve mod p
is always within 2 sqrt(p) of p+1 by Hasse's theorem,
and multiplying modulo this number isn't quite as slow
as multiplying modulo a general number,
but it's still not fast,
especially with the modifications needed to make it run in constant time.
One ECC paper proposed taking very special elliptic curves where q=p,
allowing reuse of the mod-p functions as mod-q functions,
but three groups of researchers quickly showed that taking q=p
provides
very low security.

Simply juggling where the exponents are placed,
and in particular using B^{S},
produces a signature system in which neither the signer nor the verifier
performs any divisions modulo q.
The signer performs just *one* reduction modulo q to compute S,
so there's no performance problem from using the simplest constant-time reduction methods.

**Compress R to H(R,M): H(B ^{S} A^{h},M) = −h; H(SB+hA,M) = −h.**
Instead of sending (R,S) as a signature,
the signer can send (−H(R,M),S) as a signature.
Given (h,S),
the verifier (or anyone else) recomputes R as B

If H(R,M) is shorter than R then this saves space in signatures. On the other hand, elliptic-curve points are typically sent as 256 bits, and then a 256-bit hash doesn't save any space. Sending R instead of H(R,M) also has the advantage of allowing batch signature verification, which is typically twice as fast as verifying one signature at a time. That's why Ed25519 doesn't use this compression.

**Use half-size H output.**
Given that we don't need collision resistance (see above),
do we really need a 256-bit hash?
Schnorr proposed using a half-size hash.
Together with the previous compression idea
this saves 25% of the space in signatures:
instead of a 4n-bit signature (2n-bit curve point R and 2n-bit integer S)
there's a 3n-bit signature (n-bit hash and 2n-bit integer S).

Schnorr's paper has been cited more than 2000 times, according to Google Scholar, but I'm often amazed at how few people actually seem to have read the paper. For example, twenty years later Greg Neven, Nigel Smart, and Bogdan Warinschi published a paper "Hash function requirements for Schnorr signatures" saying that Schnorr signatures don't need a collision-resistant hash function and can be reduced from 4n bits to 3n bits. As another example, David Naccache and Jacques Stern have a paper (and patent) explaining how to recover n message bits from a 4n-bit signature, reducing the effective signature overhead to 3n bits. As another example, 2n-bit pairing-based "short signatures" are usually claimed to be half the size of Schnorr signatures.

I don't mean to say that I recommend n-bit hashes: in particular, I don't think there has been adequate analysis of their security against multiple-target attacks. Ed25519 goes in the opposite direction, using a 512-bit hash function to "alleviate concerns regarding hash-function security". Remember that Ed25519 sends R rather than H(R,M), so using a longer hash doesn't cost any space: Ed25519 signatures are 64 bytes, independent of the length of the H output.

**Hash the public key: B ^{S} = R A^{H(R,A,M)}; SB = R + H(R,A,M)A.**
This is an extra modification in Ed25519
that doesn't appear in Schnorr signatures.
The idea is to
"alleviate concerns that several public keys could be attacked simultaneously":
an attacker computing many different hash values
now has to commit each hash value to a single public key being attacked.

**DSA/ECDSA, a return to the dark ages: B ^{H(M)/S} A^{R/S} = R; (H(M)/S)B + (x(R)/S)A = R.**
DSA is the ElGamal signature scheme with the following three changes:
use prime-order subgroups;
replace A with A

DSA was "invented" by NSA's David Kravitz, according to a patent application filed secretly in July 1991. It was proposed as a standard by NIST the next month. NIST didn't admit NSA's role until after a lawsuit was filed by Computer Professionals for Social Responsibility. NIST memos state that the "reasons for the selection" of DSA are summarized in an NSA document; as far as I know, that document is still classified Top Secret.

NIST received many public objections to DSA. (As NIST put it: "the number of negative comments was significantly larger than normally received for a proposed Federal Information Processing Standard".) For example, here are some of Rivest's comments:

It is my belief that the NIST proposal represents an attempt to install weak cryptography as a national standard, and that NIST is doing so in order to please the NSA and federal law enforcement agencies. ... A U.S. standard, even if weak and flawed, may be widely used overseas, making NSA's job easier.

Technical topics of the objections included
DSA's obviously breakable 2^{60} security level
(DSA was limited to 512-bit moduli);
the lack of an accompanying encryption mechanism;
DSA's poor performance;
DSA's unnecessary computation of an inverse "each time a message is to be signed";
and DSA's requirement of a cryptographically strong random number for each signature
(Rivest wrote "the poor user is given enough rope with which to hang himself").
I'll say more later about the random-number part.

NIST made one change, namely allowing 1024-bit moduli, and then issued DSA as a standard in 1994. Later NIST extended the standard to ECDSA, allowing 15 different elliptic curves that had been chosen by Jerry Solinas at NSA.

There are two different answers to this question: the modern answer used in Ed25519, and the implementation nightmare used in ECDSA.

**Edwards curves.**
The modern answer is only slightly more difficult than clock addition.
Mathematically,
the clock is a circle x^{2}+y^{2}=1.
Here are some examples of points on the clock:

- 12:00 is the point (0,1).
- 6:00 is (0,−1).
- 3:00 is (1,0).
- 2:00 is (sqrt(3/4),1/2).
- 5:00 is (1/2,−sqrt(3/4)).
- 7:00 is (−1/2,−sqrt(3/4)).

The sum of two clock points (x_{1},y_{1}) and (x_{2},y_{2})
is (x_{1}y_{2}+y_{1}x_{2},y_{1}y_{2}−x_{1}x_{2}).
You can check, for example, that the sum of 2:00 and 5:00 with this formula is 7:00.

"Elliptic curve" is not the same thing as "ellipse",
and in particular the circle isn't an elliptic curve,
but you can obtain an elliptic curve by slightly deforming a circle.
On the right is a graph of x^{2}+y^{2}=1−30x^{2}y^{2};
that's an elliptic curve.
More generally,
modern ECC uses a **complete Edwards curve** x^{2}+y^{2}=1+dx^{2}y^{2}
where the curve constant d is a non-square.
The sum of (x_{1},y_{1}) and (x_{2},y_{2}) on this curve
is ((x_{1}y_{2}+y_{1}x_{2})/(1+dx_{1}x_{2}y_{1}y_{2}),(y_{1}y_{2}−x_{1}x_{2})/(1−dx_{1}x_{2}y_{1}y_{2})).
The denominators are always nonzero.

**Weierstrass curves.**
ECDSA uses a different shape of elliptic curve,
a **short Weierstrass curve** y^{2}=x^{3}−3x+b.
The curve points aren't simply the solutions (x,y) to this equation;
there's also another curve point, ∞.
The standard addition formulas
have *six different cases*:

- If x
_{1}is not x_{2}: (x_{1},y_{1})+(x_{2},y_{2}) = (x_{3},y_{3}) where x_{3}= λ^{2}−x_{1}−x_{2}, y_{3}= λ(x_{1}−x_{3})−y_{1}, λ=(y_{2}−y_{1})/(x_{2}−x_{1}). - If y
_{1}is not 0: (x_{1},y_{1})+(x_{1},y_{1}) = (x_{3},y_{3}) where x_{3}= λ^{2}−x_{1}−x_{1}, y_{3}= λ(x_{1}−x_{3})−y_{1}, λ=(3x_{1}^{2}−3)/2y_{1}. - (x
_{1},y_{1})+(x_{1},−y_{1}) = ∞. - (x
_{1},y_{1}) + ∞ = (x_{1},y_{1}). - ∞ + (x
_{2},y_{2}) = (x_{2},y_{2}). - ∞ + ∞ = ∞.

This is a mess to implement and test (for example, Watson Ladd pointed out a week ago that RFC 6090, which is coauthored by NSA, gets it wrong in a way that can easily cause security problems), and I don't even want to try to explain how to make it run in constant time. There have been some efforts to merge the first two cases (at the expense of speed), but the merged formulas still have failure cases, and there have been fancy attacks showing how attackers can learn secret information by triggering these failure cases. It's much better to have a single formula that always works.

From the perspective of discrete-logarithm attacks (computing the secret key from the public key), Edwards curves can be viewed as Weierstrass curves, and the standard security analysis (Pollard's rho method, transfers, etc.) works in exactly the same way for both types of elliptic curves. But if you take a broader attack view (does the implementation always work? can the attacker trigger failure cases? is the implementation constant time?) then Edwards curves are much safer than Weierstrass curves. As a bonus, Edwards curves are also considerably faster than Weierstrass curves.

**History.**
Most of the ECDSA flaws that I'm describing in this blog post
were already known to be fixable when ECDSA was standardized.
The case distinctions for addition are an exception.
Before 2007, everyone thought that addition on an elliptic curve
*needed* case distinctions.
Here's some of the relevant history:

- 1985:
Victor Miller
introduced ECC,
defining elliptic curves as short Weierstrass curves y
^{2}=x^{3}+ax+b. - 1986: Chudnovsky and Chudnovsky pointed out a speedup for a=−3, and analyzed the performance of various other elliptic-curve shapes. Every addition formula in the paper has exceptional cases.
- 1987: Neal Koblitz independently introduced ECC, again defining elliptic curves as short Weierstrass curves. (The paper was submitted in 1985.)
- 1987: Montgomery introduced what are now called "Montgomery curves" and the "Montgomery ladder" formulas for single-scalar multiplication. Single-scalar multiplication is more restrictive than general addition, although it's good enough for ECDH.
- 1995: Bosma and Lenstra proved that "The smallest cardinality of a complete system of addition laws on E equals two". This means that a single addition law can't be "complete". Nobody realized, at the time, that the definition of "complete" in the paper was misleading. (Note for mathematicians: The definition is geometric, not arithmetic; it considers only algebraically closed base fields.)
- 2000: IEEE P1363 standardized y
^{2}=x^{3}−3x+b, claiming (with a citation to Chudnovsky and Chudnovsky) that this curve shape provides "the fastest arithmetic on elliptic curves". - 2000: NIST standardized ECDSA using y
^{2}=x^{3}−3x+b. - 2005: I introduced Curve25519, setting speed records for ECDH software, and proved that the Montgomery ladder has no exceptional cases on Curve25519.
- 2007: Harold Edwards published what are now called "Edwards curves".
- 2007:
Lange and I
proved that the Edwards addition law
for x
^{2}+y^{2}=1+dx^{2}y^{2}is complete if d is not a square. Specifically, if d is a non-square in a field k then the Edwards addition law for the curve E defined by x^{2}+y^{2}=1+dx^{2}y^{2}has no exceptional cases in E(k)×E(k).

Curve25519 turned out to be expressible as a complete Edwards curve. This is the curve used in the Ed25519 signature system.

The arithmetic operations used in Curve25519
are additions, subtractions, and multiplications modulo the prime number p=2^{255}−19.
Multiplications are the main bottleneck in the computation.
Divisions are delayed until the end of the computation
(i.e., intermediate results are stored as fractions);
the final division a/b is computed as ab^{p−2},
and b^{p−2} is computed as a sequence of slightly more than 255 multiplications.
Similar comments apply to Ed25519.

**Constant-time Curve25519.**
Building constant-time software for arithmetic
is very much like building unrolled hardware circuits for arithmetic:
you allocate a constant number of bits for each integer,
and always perform arithmetic on all of the bits.
For example, when you add two integers a,b between 0 and 2^{255},
each stored as 255 bits,
you always set aside 256 bits for the result a+b.
When you multiply two integers a,b between 0 and 2^{256},
you always set aside 512 bits for the result ab.

Between multiplications it's important to reduce the number of bits
by subtracting an appropriate multiple of p.
Specifically,
starting from a big integer c,
mask c to obtain the bottom 255 bits r,
shift c right by 255 bits to obtain the remaining bits q,
and then compute 19q+r;
this is the same as c−pq.
If c is between (e.g.) 0 and 2^{600} then 19q+r fits into 350 bits,
and a second iteration of the same procedure produces 256 bits.
This isn't guaranteed to be between 0 and p−1,
but it's small enough for the next multiplication.

At the end of the computation it's not enough to produce a 256-bit integer x:
it's important to *completely* reduce modulo p,
producing a result between 0 and p−1.
One safe way to do this is with two iterations
of a constant-time conditional subtraction.
A conditional subtraction replaces x with x−(1−s)p,
where s is the (twos-complement) sign bit in x−p.
This can be optimized further,
but since it's done only at the end of the computation
the exact performance doesn't matter.

**Constant-time NIST P-256.**
In principle one can adapt the ideas of the previous three paragraphs
to NIST P-256,
the usual choice of curve for ECDSA,
producing constant-time arithmetic modulo the NIST P-256 prime.
The problem is that NIST P-256
has what Adam Langley calls a
"nasty prime formation",
with many intermediate terms:
it's 2^{256}−2^{224}+2^{192}+2^{96}−1.
This makes every reduction modulo this prime
considerably more complicated and considerably slower
than reduction modulo a two-term prime.

As I mentioned, the NIST curves were chosen by Jerry Solinas at NSA.
Solinas published a paper
"Generalized Mersenne numbers"
in 1999
claiming that his primes were
"particularly well suited for machine implementation".
The ECDSA standard also has various comments on the efficiency of these primes.
For the NIST P-256 prime,
the ECDSA standard specifies the following procedure
to reduce modulo p an integer "A less than p^{2}":

- Write A as
(A
_{15}, A_{14}, A_{13}, A_{12}, A_{11}, A_{10}, A_{9}, A_{8}, A_{7}, A_{6}, A_{5}, A_{4}, A_{3}, A_{2}, A_{1}, A_{0}), meaning A_{15}2^{480}+ A_{14}2^{448}+ A_{13}2^{416}+ A_{12}2^{384}+ A_{11}2^{352}+ A_{10}2^{320}+ A_{9}2^{288}+ A_{8}2^{256}+ A_{7}2^{224}+ A_{6}2^{192}+ A_{5}2^{160}+ A_{4}2^{128}+ A_{3}2^{96}+ A_{2}2^{64}+ A_{1}2^{32}+ A_{0}2^{0}. - Compute T = (A
_{7}, A_{6}, A_{5}, A_{4}, A_{3}, A_{2}, A_{1}, A_{0}). - Compute S
_{1}= (A_{15}, A_{14}, A_{13}, A_{12}, A_{11}, 0, 0, 0). - Compute S
_{2}= (0, A_{15}, A_{14}, A_{13}, A_{12}, 0, 0, 0). - Compute S
_{3}= (A_{15}, A_{14}, 0, 0, 0, A_{10}, A_{9}, A_{8}). - Compute S
_{4}= (A_{8}, A_{13}, A_{15}, A_{14}, A_{13}, A_{11}, A_{10}, A_{9}). - Compute D
_{1}= (A_{10}, A_{8}, 0, 0, 0, A_{13}, A_{12}, A_{11}). - Compute D
_{2}= (A_{11}, A_{9}, 0, 0, A_{15}, A_{14}, A_{13}, A_{12}). - Compute D
_{3}= (A_{12}, 0, A_{10}, A_{9}, A_{8}, A_{15}, A_{14}, A_{13}). - Compute D
_{4}= (A_{13}, 0, A_{11}, A_{10}, A_{9}, 0, A_{15}, A_{14}). - Compute T+2S
_{1}+2S_{2}+S_{3}+S_{4}−D_{1}−D_{2}−D_{3}−D_{4}. - Reduce modulo p "by adding or subtracting a few copies" of p.

How is the implementor supposed to handle "adding or subtracting a few copies"?
A loop would take variable time, presumably leaking secret information.
A pile of 6 constant-time conditional operations will work correctly
(conditional addition of 4p,
conditional addition of 2p,
conditional addition of p,
conditional subtraction of 4p,
conditional subtraction of 2p,
conditional subtraction of p)
but is also quite slow.
An implementor who removes these steps,
delaying complete reduction until the end of the computation,
then has to handle products that are not "less than p^{2}",
and the standard doesn't give any hint of how to adapt its reduction procedure
to handle such inputs.

**Radix flexibility.**
The standard NIST P-256 reduction procedure becomes even more painful
if integers aren't represented in radix 2^{32} (or 2^{16} or 2^{8}):
the word shuffling required for T,S_{1},S_{2},S_{3},S_{4},D_{1},D_{2},D_{3},D_{4}
then turns into a long sequence of bit manipulations.
The reason this is important is that radix 2^{32} isn't the best way
to carry out big-integer arithmetic on most CPUs.
Even on CPUs where the largest multiplier size is exactly 32 bits,
it's almost always better to use a radix smaller than 2^{32},
so that carries can be delayed.

For comparison,
the prime 2^{255}−19 is friendly to radix 2^{32} (simply reduce modulo 2^{256}−38)
but is also friendly to many other choices of radix.
Curve25519 is a frequent topic of ECC-optimization papers
(not just my own papers!),
and those papers have used many different choices of radix
to take advantage of different multipliers on different platforms:

- My original paper for Pentium M etc.:
radix 2
^{21.25}using double-precision floating-point, and radix 2^{25.5}using extended-precision floating-point. - 2007
Gaudry–Thomé
for Core 2 and Athlon 64:
radix 2
^{64}using 64-bit integers. - 2009
Costigan–Schwabe
for Cell:
radix 2
^{12.75}using 16-bit integers. - 2011
Bernstein–Duif–Lange–Schwabe–Yang
for Nehalem and Westmere:
radix 2
^{64}and 2^{51}using 64-bit integers. - 2012
Bernstein–Schwabe for Cortex A8:
radix 2
^{25.5}using 32-bit integers. - 2014
Langley–Floodyberry
(no paper but software online) for various Intel chips:
radix 2
^{25.5}using 32-bit integers, and radix 2^{51}using 64-bit integers. - 2014
Mahé–Chauvet
for various GPUs:
radix 2
^{25.5}using 32-bit integers, and radix 2^{51}using 64-bit integers.

A simple 32-bit implementation of Curve25519 (or Ed25519)
is already considerably faster than NIST P-256 (or ECDSA using NIST P-256),
and tuning the radix provides an extra speed boost,
thanks to the simple shape of 2^{255}−19.

**Other standard primes.**
To be fair I should mention that there's one standard NIST curve
using a nice prime, namely 2^{521}−1;
but the sheer size of this prime makes it much slower than NIST P-256.
The other NIST primes are
2^{192}−2^{64}−1 (substandard security level),
2^{224}−2^{96}+1,
and 2^{384}−2^{128}−2^{96}+2^{32}−1.
The secp256k1 (Bitcoin) prime is 2^{256}−2^{32}−977.
The French (ANSSI FRP256v1) and German (Brainpool) standard primes
have many more terms and are much slower.

One central source of curve specifications is safecurves.cr.yp.to. This is an ECC security-evaluation site that Lange and I run, covering 20 different curves so far. That site doesn't evaluate performance, but many curves are included in the central eBACS benchmarking site that I mentioned earlier.

Signing starts by generating a secret random integer k and then computing R = kB. I'll discuss the randomization later. The main bottleneck in signing is the computation of kB from k, a "fixed-base-point scalar multiplication".

One easy way to compute kB
is to split k into (say) 4-bit pieces.
For example,
k has 256 bits in Ed25519,
so it can be written as
k_{0} + 2^{4} k_{1} + 2^{8} k_{2} + 2^{12} k_{3} + ... + 2^{252} k_{63}
where each k_{i} is in {0,1,2,...,15}.
Then kB is the sum of 64 points, namely
k_{0} B, 2^{4} k_{1} B, 2^{8} k_{2} B, 2^{12} k_{3} B, ..., 2^{252} k_{63} B.
The term k_{0} B is looked up in a table containing
the 16 points 0B,1B,2B,...,15B;
the term 2^{4} k_{1} B is looked up in a table containing
the 16 points 2^{4} 0B,2^{4} 1B,2^{4} 2B,...,2^{4} 15B;
same for all the other terms.
There are many possible tradeoffs
between the total table size and the number of additions,
and there are slightly more complicated methods
that improve the tradeoffs.

Wait a minute:
for security we have to avoid secret array indices.
So how do we look up k_{0} B in a table?
The index k_{0} is secret!

What the Ed25519 software actually does is read *all* of the table entries
and perform arithmetic to extract the correct table entry.
The basic idea is simple (and fundamentally the same as what RAM hardware does):
if b is a bit then x[b] is the same as x[0]+b(x[0]−x[1]).
You can replace the addition, multiplication, and subtraction here
with XOR, AND, and XOR respectively.
(On some processors there are "conditional move"/"select" instructions
that perform the same computation,
although you have to be careful to check whether those instructions
actually take constant time.)
If b and c are bits then x[2*c+b] is the same as y[0]+c(y[0]−y[1]),
where y[0] is x[0]+b(x[0]−x[1])
and y[1] is x[2]+b(x[2]−x[3]).

**Scalar multiplication for ECDSA.**
This method of computing kB in constant time might seem quite general
and trivially applicable to ECDSA.
The big problem is that ∞ is an exceptional case
for the Weierstrass addition formulas,
and the point 0B appearing in each table is exactly ∞.

There are several ways to work around this problem
without huge extra performance penalties.
For example,
one can change the range of each k_{i} to exclude 0,
although expanding k as a sum of k_{i} then requires some thought.
As another example,
one can use arithmetic to replace P+0 with P
in the case that k_{i} is 0,
so that a bogus computation of P+0 doesn't cause problems,
although handling the possibility of an *initial* 0 then requires some thought.

The ANSI X9.62 ECDSA standard doesn't consider these issues.
(The NIST ECDSA standard also doesn't consider these issues:
it simply specifies the NSA choices of elliptic curves
and cites ANSI X9.62 for the remaining details of ECDSA.)
The standard instead states a scalar-multiplication method commonly known as "NAF";
this method doesn't involve any additions of 0
but *doesn't take constant time*.
The OpenSSL software broken by Benger, van de Pol, Smart, and Yarom
uses a faster scalar-multiplication method commonly known as "wNAF",
which also doesn't take constant time.

I haven't actually bought a copy of the X9.62 standard; I'm looking at Appendix D.3.2 in a free online draft of the standard. Maybe the standard was changed after this draft, but back then people weren't really paying attention to timing attacks, so I would be surprised if the final standard explains a constant-time scalar-multiplication algorithm, while it's completely unsurprising to see that the OpenSSL ECDSA implementation uses a non-constant-time scalar-multiplication algorithm.

It's of course possible that a constant-time scalar-multiplication algorithm will appear in a future ECDSA standard. But this will be yet another way that ECDSA is more complicated, and slower, than a properly designed elliptic-curve signature system.

ElGamal's signing procedure required the signer to choose a uniform random R independently for each message, by choosing the discrete logarithm k of R as a uniform random secret integer modulo q. ElGamal pointed out that a sufficiently severe screwup, such as publishing k or using the same k for two messages, would reveal the signer's long-term secret key. This is exactly how the Sony PlayStation 3 ECDSA signing key was discovered in 2010.

Often k is called a "nonce";
a **nonce**, by definition, is a number used once.
However,
this understates the security requirements for k.
Followup papers showed that less severe deviations from randomness,
such as always choosing the top 4 bits of k to be 0
or merely revealing the top 4 bits of each k,
would also reveal the signer's secret key.
Recent papers on these attack techniques include
De Mulder–Hutter–Marson–Pearson
and the new Benger–van de Pol–Smart–Yarom paper.

**The costs of RNGs.**
An appendix to the 1994 DSA standard
specified a typical RNG for generating k,
starting by hashing a "secret initial value"
and then updating this initial value for the next k.
This is not a bad simulation of true randomness,
but it begs the question of where the "secret initial value" comes from.
It's also **stateful**:
the RNG state *must* be updated for each signature.
Multiple signers can't simply clone a single state.
One has to be careful with crashes, VM rewinding, etc.

Fortunately, the costs of getting an RNG right can be shared across applications. It's safe to have a single centralized RNG in an operating system (or, even better, a hypervisor) providing random numbers to all applications; if the RNG is secure then the outputs are indistinguishable from the outputs of separate application-specific secure RNGs. We need at least one secure RNG in any case to generate long-term secret keys, so we can reuse that RNG to generate k.

**Avoiding RNGs in signatures.**
It would be better if the signature system
didn't need new randomness for each signature in the first place.
The RNG would be used much less frequently,
and could afford to be designed more conservatively.
We wouldn't have to worry about the RNG taking entropy from
malicious sources.
We could move a long-term secret key
into a highly protected hardware device for signing,
without having to pay for an RNG in the hardware device
and without having to worry about the quality of that RNG.
We could have *test vectors* for signing functions,
because those functions wouldn't have a hidden RNG input.

This is exactly what was accomplished by a very simple modification to DSA suggested by George Barwood in February 1997: namely, "a prng is used to generate k, keyed with d (the private key), and seeded with e (the message digest)." Essentially the same idea was suggested by John Wigley in March 1997: "Why not derive the random number from the message to be signed therefore removing the need for a hardware rng (smartcards) or a complex and potentially fallible software attempt. ... 1)Hash the message to be signed as usual and use that as the message in the signature algorithm. 2)hash the private key using the chaining variable state left by the previous hash, and use that as the random number."

The same idea was reinvented by M'Raïhi, Naccache, Pointcheval, and Vaudenay, who managed to stretch it out into a multiple-page conference paper. Naccache, M'Raïhi, and Levy-dit-Vehel had actually filed a patent application on the idea in May 1997; they abandoned the patent application in 2003. The idea was yet again reinvented by Katz and Wang and published as part of a 2003 paper, although the main point of that paper was something else so the idea wasn't stretched out.

**Maximum security.**
Ed25519 actually generates k by hashing the secret key together with the entire message.
Very fast "PRFs" used to authenticate messages, such as VMAC, are safe here,
but Ed25519 simply uses SHA-512
since it's quite rare for this to be a noticeable bottleneck.
Using an RNG wouldn't break interoperability,
but the safest option is the default.
Using Ed25519 to sign H(m), rather than m, would mean hashing m only once
(at the expense of collision resilience),
but again the safest option is the default.

Ed25519 has two secret keys, one used for the hashing to produce k and one used as the discrete logarithm of the public key, to prevent any weird interactions between the hashing and the rest of the signature system. Both of these keys are actually derived from a single 256-bit master secret key, but if the key-derivation function is secure then this is indistinguishable from having two independent secret keys. This derivation also means that quite severe deviations from randomness in the master secret key, such as having the first 64 bits all zero, still won't allow any easy attacks.

**More possibilities for the hash input.**
I've heard many stories from DSA/ECDSA implementors
who generate k as a hash of various inputs.
All of the inputs I've heard about fall into four categories:

- P: a string from some PRNG.
- X: the DSA secret key, or some better-separated secret.
- N: the number of messages signed.
- M: the message, or a collision-resistant hash of the message.

It's important for k to be unguessable. M and N are guessable, so it's critical to use at least one of P, X. There's an easy argument that using X is sufficient here: if X were guessable then the whole system would be broken no matter how k is generated. Note that the same argument doesn't apply to P.

It's also important for k to be different for each message. X doesn't vary from one message to another, so it's critical to use at least one of P, N, M. Using M is obviously sufficient here. Using N is also obviously sufficient. Using P might be sufficient but requires a serious PRNG audit.

It's also helpful for signing to be stateless.
This means that it's *bad* to use N or P.

All of these criteria are satisfied by simply hashing (X,M), as Barwood and Wigley proposed. (P,X,M), (X,N,M), and (P,X,N,M) lose statelessness, and I have no idea why they're supposed to be better than (X,M). There's a very small speed argument for omitting M in favor of N, producing (X,N) or (P,X,N), but these lose statelessness and are certainly more difficult to get right. I'm not aware of any argument for omitting X.

[2022.01.09 update: Updated links above.]