The cr.yp.to blog



2017.07.19: Benchmarking post-quantum cryptography

SUPERCOP, the System for Unified Performance Evaluation Related to Cryptographic Operations and Primitives, is an open benchmarking package that measures (currently) 2202 implementations of 602 cryptographic primitives.

For example, the crypto_hash/sha256 directory in SUPERCOP contains five implementations of the standard SHA-256 hash function: crypto_hash/sha256/ref (a reference implementation designed to be easy to read), crypto_hash/sha256/cryptopp (a fast C++ implementation using the Crypto++ library), crypto_hash/sha256/openssl, crypto_hash/sha256/sphlib, and crypto_hash/sha256/sphlib-small. Each computer automatically selects the fastest SHA-256 implementation for the final speed reports, the same way that a user prioritizing performance will select the fastest implementation. Including a slow implementation in the benchmarks doesn't cause any harm, except for a slight waste of benchmarking time.

There are nearly 300 people listed in "designers" files in primitives included in SUPERCOP, and nearly 150 people listed in "implementors" files in implementations; see, e.g., the online list of hash functions and implementations. Code is sometimes shared between implementations of primitives in the same family (for example, crypto_stream/chacha12 is almost identical to crypto_stream/chacha20), but there are a huge number of different families. One of the main reasons that we've successfully reached this scale is that we've made it very easy to add new primitives and new implementations to SUPERCOP. To add, e.g., a new hash function, you implement just one C function:

   #include "crypto_hash.h"

   int crypto_hash(
     unsigned char *out,
     const unsigned char *in,unsigned long long inlen
   )
   {
     ...
     return 0;
   }

You also create a separate file api.h with one line

   #define CRYPTO_BYTES 32

saying how big the hash output is. That's it.

Most cryptographic operations are more complicated than hashing, but the SUPERCOP interface consistently keeps the fuss to a minimum. Signatures, for example, use exactly three functions. First,

   int crypto_sign_keypair(
     unsigned char *pk,
     unsigned char *sk
   )

generates a public key and secret key (and returns 0). Second,

   int crypto_sign(
     unsigned char *sm,unsigned long long *smlen,
     const unsigned char *m,unsigned long long mlen,
     const unsigned char *sk
   )

uses a secret key sk to sign a message m containing mlen bytes, creating a signed message sm containing *smlen bytes (and again returns 0). Third,

   int crypto_sign_open(
     unsigned char *m,unsigned long long *mlen,
     const unsigned char *sm,unsigned long long smlen,
     const unsigned char *pk
   )

uses a public key pk to verify a signed message sm, creating a verified message m (and returns 0, or −1 if verification failed). Finally, there's a file api.h containing, e.g.,

   #define CRYPTO_SECRETKEYBYTES 64
   #define CRYPTO_PUBLICKEYBYTES 32
   #define CRYPTO_BYTES 64

to say that sk has 64 bytes, pk has 32 bytes, and sm is at most 64 bytes longer than m.

It's not a coincidence that this is exactly the same easy-to-use API provided by the increasingly popular NaCl (pronounced "salt") family of production cryptographic libraries: notably the original NaCl, libsodium, and TweetNaCl ("tweet-salt"). Obviously production libraries are under more constraints than benchmarks—

—but none of these constraints require a different API. See my paper "The security impact of a new cryptographic library" with Lange and Schwabe for further analysis of the choices made in NaCl.

The NIST post-quantum competition

NIST has announced a Post-Quantum Cryptography Project to develop standards for post-quantum cryptography. Submissions are due by the end of November 2017. (I should note that NIST doesn't refer to this competition as a "competition"; NIST seems to think that in a "competition" there must be only one winner, the one algorithm that manages to chop off the heads of all the others.)

A submission that turns out to be much less secure than originally claimed will almost certainly be kicked out of the competition: people will be scared of a security story that's unstable, or that's actually stable but poorly understood. But suppose many submissions survive security analysis. How will NIST decide what to standardize?

One of the most obvious ways to answer this question is to ask which submissions offer the best tradeoffs between performance and (apparent) security: i.e., which submissions offer the best security levels under various performance constraints. But this requires understanding how submissions perform, which brings me back to benchmarking.

The most important topics in this competition are public-key signatures and public-key encryption. I've happy to announce that SUPERCOP is now ready for post-quantum submissions of both types.

Case study: crypto_kem/rsa2048

The new crypto_kem directory in SUPERCOP is for key-encapsulation mechanisms (KEMs). KEMs are a simple, robust way to handle public-key encryption, also highlighted in my previous blog post.

What makes KEMs easier than other types of public-key encryption is that they don't take a user-specified message as input. Instead they generate and transport a secret session key. For example, Shoup's "RSA-KEM" (originally named "Simple RSA") simply chooses a random number m modulo the public key pq, and sends the ciphertext m3 mod pq to transport the session key SHA-256(m). This session key can then be used to encrypt and authenticate any number of messages to the receiver (or from the receiver back to the sender).

crypto_kem/rsa2048 is RSA-KEM using a 2048-bit public key pq. There are two implementations: crypto_kem/rsa2048/gmp using the C interface to the GMP bigint library (which is automatically compiled as part of SUPERCOP), and crypto_kem/rsa2048/gmpxx using the C++ interface to the same library. The C++ implementation has the advantage of being somewhat easier to read; the C implementation has the advantage of being callable from C programs. The implementations have essentially the same speed.

There are four sizes in api.h. CRYPTO_PUBLICKEYBYTES is 256 (i.e., 2048 bits). CRYPTO_CIPHERTEXTBYTES is also 256. CRYPTO_BYTES is 32 (i.e., 256 bits), the size of the session key. CRYPTO_SECRETKEYBYTES is 384; this is used for two 1024-bit primes p and q, and the 1024-bit inverse of p modulo q.

The C++ implementation defines crypto_kem_keypair in keypair.cpp, crypto_kem_enc in enc.cpp, and crypto_kem_dec in dec.cpp. It also has two subroutines, gmpxx_export.cpp and gmpxx_import.cpp, for conversions between GMP bigints (mpz_class) and little-endian strings. The C implementation is structured similarly.

To hash the random number m, enc.cpp and dec.cpp simply call the crypto_hash_sha256 function defined elsewhere in SUPERCOP:

   #include "crypto_hash_sha256.h"
   ...
   crypto_hash_sha256(k,mstr,NBYTES);

Notice that this is including crypto_hash_sha256.h and calling crypto_hash_sha256, whereas the SHA-256 implementation included crypto_hash.h and defined crypto_hash. The extended name crypto_hash_sha256 allows several hash functions to coexist inside the SUPERCOP library. Another automatic name change is that the

   #define CRYPTO_BYTES 32

from api.h in the SHA-256 implementation turns into a definition of crypto_hash_BYTES in crypto_hash.h in that implementation, and a definition of crypto_hash_sha256_BYTES in crypto_hash_sha256.h.

Does the NIST API allow submitters to call crypto_hash_sha256? The answer isn't clear. The alternative is to demand separate SHA-256 implementations from every submitter who wants to use SHA-256; I see this as a pointless waste of time. The benchmarking framework can and does provide SHA-256, and someone who wants to write another benchmarking framework should also provide SHA-256 instead of asking a bunch of submitters to provide it. Recommendation to NIST: Adjust the API to clearly allow submissions to call crypto_hash_sha256.

Does the NIST API allow submitters to call GMP? Again the answer isn't clear, even though there will certainly be submissions that will benefit from GMP. NIST has made a generic proposal to allow each submission package to include a script to compile and install external libraries, but

Recommendation to NIST: Adjust the API to clearly allow submissions to call GMP.

Finally, there are actually two different security goals for KEMs, a stronger "active" goal ("IND-CCA2") and a weaker "passive" goal. There's a file crypto_kem/rsa2048/active indicating that crypto_kem/rsa2048 is aiming for the active goal. Of course, we'll benchmark submissions to SUPERCOP whether or not they have such security declarations.

Converting KEMs to traditional public-key cryptosystems

In my previous blog post, I suggested

defining some standard conversions that NIST will apply automatically: e.g., converting a CCA2-secure KEM into CCA2-secure PKE by composition with AES-256-GCM, and converting the other way by encrypting a random 256-bit key. NIST won't want to listen to pointless arguments such as "yes we know we're worse than this PKE but it wasn't submitted to the KEM category" from KEM submitters, and won't want to have to wade through artificially bloated PKE+KEM submissions that are really just one design but want to compete in every category.

PKE is the traditional notion of a public-key cryptosystem, where the ciphertext communicates a user-selected plaintext.

NIST has issued a vague statement that it will "apply standard conversion techniques to convert between schemes if necessary", but it hasn't made an explicit list of which conversions will be applied. For example, it's totally unclear to me whether NIST will automatically treat KEM submissions as PKE submissions, and, if so, what conversion it will apply. Recommendation to NIST: Make an explicit list of automatic conversions that NIST is guaranteed to apply.

Here's an illustrative example of a conversion that can be applied (by submitters or by NIST). There's now a crypto_encrypt/rsa2048 that puts together crypto_kem/rsa2048 and crypto_aead/aes256gcmv1 in a straightforward way, using the session key transported by RSA-KEM as the key for AES-256-GCM to encrypt and authenticate the plaintext. The PKE ciphertext is the KEM ciphertext followed by the AES-256-GCM ciphertext. (Maybe I should again emphasize that SUPERCOP is generally unaudited; even something that sounds this easy can be screwed up.)

SUPERCOP compiles crypto_kem before crypto_encrypt. Any crypto_encrypt function can call any crypto_kem_X function, provided that there's a crypto_kem/X/used file to tell SUPERCOP to include crypto_kem_X in the library. The crypto_kem/X/Y implementation has to follow standard library-namespacing rules, keeping all non-static symbols within the crypto_kem_X_Y_* namespace. This is done automatically by crypto_kem.h for the crypto_kem_keypair, crypto_kem_enc, and crypto_kem_dec functions.

I've considered including a crypto_kem2 compiled after crypto_encrypt, to simplify code reuse in the other direction, but this has been subsumed by a broader plan to allow arbitrary directed acyclic graphs of dependencies. As a workaround in the meantime, submitters can use symbolic links to share software across directories.

There was already a crypto_encrypt/ronald2048 using RSA-2048 in a more complicated way, including "message recovery" to save space: if the plaintext is long then the ronald2048 ciphertext is just 75 bytes longer. As this example illustrates, building PKE directly can have an efficiency advantage compared to building PKE via KEM. Typically the session key transported by a KEM is reused to encrypt many packets, making this advantage unnoticeable, but sometimes it's important to squeeze as much as possible into the first packet.

The ronald2048 implementation is a wrapper that I wrote almost a decade ago on top of OpenSSL. (This assumes that the operating system has OpenSSL installed. Including OpenSSL in SUPERCOP, and specifically dealing with the OpenSSL build system, turned out to be too much of a hassle, although maybe this decision should be revisited after various improvements to OpenSSL.) GMP seems about twice as fast as OpenSSL on Haswell, suggesting that ronald2048 should be reimplemented on top of GMP. GMP now includes various sec functions that are supposed to run in constant time, although I haven't tried using these yet.

Frequently asked questions

Where does SUPERCOP discussion happen? Send an empty message to ebats-subscribe at list.cr.yp.to to join the mailing list.

How do I obtain random bytes in my software? Call randombytes wherever the specification asks for random bytes. randombytes is a strong random-number generator (RNG) provided by SUPERCOP. Don't use rand or random: they aren't cryptographically strong. If possible, don't use the OpenSSL RNG: SUPERCOP's automatic testing relies on all randomness coming from randombytes.

Later in this blog post I'll give examples of how randombytes is used. My next blog post will dive into the internal details of the "fast-key-erasure RNG" now used by SUPERCOP, but you can simply call randombytes without worrying about these details.

Do I have to submit both a reference implementation and an "optimized" implementation? No. If in doubt, submit just a reference implementation: an implementation that's designed to be as easy as possible to read, without concern for speed. There are many good examples (and some not-so-good examples) in various ref directories in SUPERCOP.

Of course you can also submit one or more "optimized" implementations, but your best bet for making things run really fast is to attract the attention of an optimization expert. The expert will start by asking how much arithmetic your code is doing and whether some of this arithmetic can be removed; this is generally much easier to understand from a reference implementation than from an "optimized" implementation.

Can my software use gcc features beyond C89 ("ANSI C")? Yes. Whatever compiles and runs on our benchmarking machines is fine. This is a benchmarking project, not a language-lawyering project.

Practically all of the benchmarking machines have gcc installed, and the stragglers have clang, which seems to try hard to support all of the usual uses of gcc. SUPERCOP automatically tries many different compiler options; see okcompilers/c for the list, and okcompilers/cpp for the corresponding C++ list. If your code desperately needs another compiler option, please speak up on the mailing list to explain the issue. Most, although not all, of the machines are little-endian. All of the machines have two's-complement arithmetic and 8-bit "char".

Can the optimized software use vector instructions? Assembly? Yes, of course. Most cryptographic speed records use vector instructions and/or assembly. Recommendation to NIST: Clearly state that this is allowed.

Can the optimized software use multiple cores? No. Running on multiple cores might make sense for some big operations (such as RSA key generation) but creates various measurement issues that SUPERCOP hasn't learned how to handle.

Randomness and known-answer tests

Let's go back to crypto_kem/rsa2048. To generate a random m in the first place, enc.cpp calls a randombytes function provided by SUPERCOP:

   #include "randombytes.h"
   ...
   unsigned char mstr[NBYTES];
   ...
   randombytes(mstr,NBYTES);

Similarly, keypair.cpp has a randomprime function that calls randombytes in a loop. Each time through the loop generates a random integer and then checks whether the integer is prime.

SUPERCOP has two implementations of randombytes. Both of them are built from strong high-speed stream ciphers. In previous versions of SUPERCOP, randombytes was strong but much slower. The new speed doesn't really matter for RSA, and in general there have been very few complaints about the time for randombytes (which has always been included in SUPERCOP's measurements); but randomness is a much larger part of the overall computation for some post-quantum primitives, making RNG speed an important issue. If those primitives end up being deployed then production cryptographic libraries will also start paying more attention to the speed of randomness generation.

The big difference between the two randombytes implementations is that one, fastrandombytes, is seeded from the kernel's random-number generator and used by SUPERCOP for benchmarks, while the other, knownrandombytes, is deterministically seeded and used by SUPERCOP to automatically generate known-answer tests. Of course, making this automatic strategy work means prohibiting all use of randomness from other sources. That's why I said that you shouldn't use, e.g., the random-number generator built into OpenSSL.

Sometimes a simple reference implementation will have a loop of processing random numbers, say 1024 times calling randombytes(...,4) and processing the result. An optimized implementation might instead call randombytes(...,4096) and use vector instructions to process the results. Will the resulting bytes from randombytes be identical for known-answer tests? SUPERCOP now guarantees that the answer is yes. Recommendation to NIST: Make the same guarantee in the NIST API.

If an optimization involves a heavier change in randomness generation (e.g., switching a lattice-based system from one "sampler" to another), then the optimization is a different primitive, with different results in known-answer tests and potentially with a different security level, so it has to be submitted to SUPERCOP under a different name. But merely changing how the randombytes stream is divided into packets doesn't require splitting primitives.

SUPERCOP now reports the number of bytes obtained from randombytes, and the number of calls to randombytes, allowing reasonably accurate extrapolations of the impact of changes in the performance of randombytes.

How NIST currently handles randomness and known-answer tests

The NIST API puts three unfortunate restrictions on how implementations use randombytes.

First: NIST requires implementations to provide not merely keygenerate and encapsulate and decapsulate (i.e., keypair and enc and dec), but also a deterministic keygenerate_KAT and a deterministic encapsulate_KAT. These deterministic functions take as extra input a limited-size seed (24, 32, or 40 bytes, depending on the security level), and are required to deterministically expand this seed into whatever number of bytes they actually need.

The rsa2048 example illustrates how annoying this would be. Consider, for example, the rejection-sampling loop in encryption, used to generate a uniform random integer m below the public key n:

   do {
     randombytes(mstr,NBYTES);
     m = gmpxx_import(mstr,NBYTES);
   } while (m >= n);

This randombytes call has to be replaced: mstr has to be produced in some specified way from the seed provided as encapsulate_KAT input. There has to be some sort of state initialized from the seed, modified each time through the loop, and used to generate mstr.

The situation is even worse for key generation: randombytes is naturally called inside randomprime, so the state needs to be passed from the key-generation function to randomprime. This means more arguments to the randomprime function (including a return mechanism for the updated state); or, alternatively, putting the state into a global variable, which is okay for benchmarking but pointlessly separates benchmarking from production.

Why is this randomness-state-management effort being imposed upon every implementor instead of being done once inside the implementation of randombytes? Why should public-key people have to worry about randomness-generation details?

Second: NIST requires the deterministic expansion to use a NIST-approved DRBG (unless the submitter can explain why a NIST-approved DRBG wouldn't be "suitable"). This makes the submitter's job even more complicated: NIST's DRBGs are not particularly easy to understand or implement.

For comparison, if submitters simply call randombytes whenever they need randomness, then using a NIST-approved DRBG to generate randomness boils down to one person implementing the DRBG and providing it as randombytes. This is much better software engineering than having every submitter reimplement the DRBG.

Third: Are you thinking that it will at least be possible to submit keygenerate and encapsulate functions that simply call randombytes, so that readers can skip the bloated keygenerate_KAT and encapsulate_KAT?

Nope. NIST requires the non-deterministic keygenerate and encapsulate to limit their use of randombytes, generating only the same limited-size seed used by the KAT calls. So a simple implementation isn't possible: the submitter can't do better than having keygenerate and encapsulate call keygenerate_KAT and encapsulate_KAT respectively.

(You can and should provide simpler keypair and enc and dec functions for SUPERCOP, using randombytes for randomness.)

Recommendation to NIST: Change the NIST API to allow randombytes to be called as often as desired. Explicitly say that randombytes should be used whenever the specification calls for randomness. Rename keygenerate and encapsulate and decapsulate as keypair and enc and dec (but don't do this if you're still insisting on incompatibilities in the randomness semantics!). Scrap the extra KAT calls: they're a big step backwards from SUPERCOP's automatic known-answer tests.


Version: This is version 2017.07.19 of the 20170719-pqbench.html web page.