cr.yp.to: 2017.10.17: Quantum algorithms to find collisions

Newer (Access-K): 2017.11.05: Reconstructing ROCA: A case study of how quickly an attack can be developed from a limited disclosure. #infineon #roca #rsa

Older (Access-J): 2017.07.23: Fast-key-erasure random-number generators: An effort to clean up several messes simultaneously. #rng #forwardsecrecy #urandom #cascade #hmac #rekeying #proofs

Table of contents (Access-I for index page)

2025.04.23: McEliece standardization: Looking at what's happening, and analyzing rationales. #nist #iso #deployment #performance #security

2025.01.18: As expensive as a plane flight: Looking at some claims that quantum computers won't work. #quantum #energy #variables #errors #rsa #secrecy

2024.10.28: The sins of the 90s: Questioning a puzzling claim about mass surveillance. #attackers #governments #corporations #surveillance #cryptowars

2024.08.03: Clang vs. Clang: You're making Clang angry. You wouldn't like Clang when it's angry. #compilers #optimization #bugs #timing #security #codescans

2024.06.12: Bibliography keys: It's as easy as [1], [2], [3]. #bibliographies #citations #bibtex #votemanipulation #paperwriting

2024.01.02: Double encryption: Analyzing the NSA/GCHQ arguments against hybrids. #nsa #quantification #risks #complexity #costs

2023.11.25: Another way to botch the security analysis of Kyber-512: Responding to a recent blog post. #nist #uncertainty #errorbars #quantification

2023.10.23: Reducing "gate" counts for Kyber-512: Two algorithm analyses, from first principles, contradicting NIST's calculation. #xor #popcount #gates #memory #clumping

2023.10.03: The inability to count correctly: Debunking NIST's calculation of the Kyber-512 security level. #nist #addition #multiplication #ntru #kyber #fiasco

2023.06.09: Turbo Boost: How to perpetuate security problems. #overclocking #performancehype #power #timing #hertzbleed #riskmanagement #environment

2022.08.05: NSA, NIST, and post-quantum cryptography: Announcing my second lawsuit against the U.S. government. #nsa #nist #des #dsa #dualec #sigintenablingproject #nistpqc #foia

2022.01.29: Plagiarism as a patent amplifier: Understanding the delayed rollout of post-quantum cryptography. #pqcrypto #patents #ntru #lpr #ding #peikert #newhope

2020.12.06: Optimizing for the wrong metric, part 1: Microsoft Word: Review of "An Efficiency Comparison of Document Preparation Systems Used in Academic Research and Development" by Knauff and Nejasmic. #latex #word #efficiency #metrics

2019.10.24: Why EdDSA held up better than ECDSA against Minerva: Cryptosystem designers successfully predicting, and protecting against, implementation failures. #ecdsa #eddsa #hnp #lwe #bleichenbacher #bkw

2019.04.30: An introduction to vectorization: Understanding one of the most important changes in the high-speed-software ecosystem. #vectorization #sse #avx #avx512 #antivectors

2017.11.05: Reconstructing ROCA: A case study of how quickly an attack can be developed from a limited disclosure. #infineon #roca #rsa

2017.10.17: Quantum algorithms to find collisions: Analysis of several algorithms for the collision problem, and for the related multi-target preimage problem. #collision #preimage #pqcrypto

2017.07.23: Fast-key-erasure random-number generators: An effort to clean up several messes simultaneously. #rng #forwardsecrecy #urandom #cascade #hmac #rekeying #proofs

2017.07.19: Benchmarking post-quantum cryptography: News regarding the SUPERCOP benchmarking system, and more recommendations to NIST. #benchmarking #supercop #nist #pqcrypto

2016.10.30: Some challenges in post-quantum standardization: My comments to NIST on the first draft of their call for submissions. #standardization #nist #pqcrypto

2016.06.07: The death of due process: A few notes on technology-fueled normalization of lynch mobs targeting both the accuser and the accused. #ethics #crime #punishment

2016.05.16: Security fraud in Europe's "Quantum Manifesto": How quantum cryptographers are stealing a quarter of a billion Euros from the European Commission. #qkd #quantumcrypto #quantummanifesto

2016.03.15: Thomas Jefferson and Apple versus the FBI: Can the government censor how-to books? What if some of the readers are criminals? What if the books can be understood by a computer? An introduction to freedom of speech for software publishers. #censorship #firstamendment #instructions #software #encryption

2015.11.20: Break a dozen secret keys, get a million more for free: Batch attacks are often much more cost-effective than single-target attacks. #batching #economics #keysizes #aes #ecc #rsa #dh #logjam

2015.03.14: The death of optimizing compilers: Abstract of my tutorial at ETAPS 2015. #etaps #compilers #cpuevolution #hotspots #optimization #domainspecific #returnofthejedi

2015.02.18: Follow-You Printing: How Equitrac's marketing department misrepresents and interferes with your work. #equitrac #followyouprinting #dilbert #officespaceprinter

2014.06.02: The Saber cluster: How we built a cluster capable of computing 3000000000000000000000 multiplications per year for just 50000 EUR. #nvidia #linux #howto

2014.05.17: Some small suggestions for the Intel instruction set: Low-cost changes to CPU architecture would make cryptography much safer and much faster. #constanttimecommitment #vmul53 #vcarry #pipelinedocumentation

2014.04.11: NIST's cryptographic standardization process: The first step towards improvement is to admit previous failures. #standardization #nist #des #dsa #dualec #nsa

2014.03.23: How to design an elliptic-curve signature system: There are many choices of elliptic-curve signature systems. The standard choice, ECDSA, is reasonable if you don't care about simplicity, speed, and security. #signatures #ecc #elgamal #schnorr #ecdsa #eddsa #ed25519

2014.02.13: A subfield-logarithm attack against ideal lattices: Computational algebraic number theory tackles lattice-based cryptography.

2014.02.05: Entropy Attacks! The conventional wisdom says that hash outputs can't be controlled; the conventional wisdom is simply wrong.

2017.10.17: Quantum algorithms to find collisions: Analysis of several algorithms for the collision problem, and for the related multi-target preimage problem. #collision #preimage #pqcrypto

The title sounds scary, doesn't it? Collision-resistance is one of the primary features promised by cryptographic hash functions such as SHA-256 and SHA3-256. The CNS algorithm is generic: it isn't specific to any particular hash function, so it applies to SHA-256 and SHA3-256 and everything else that comes to mind. Some signature systems are designed to be "collision-resilient", meaning that they aren't broken by collisions in the underlying hash function; but collision-finding is a basic subroutine inside many more attack algorithms. An efficient generic algorithm for collision search would have huge ramifications.

The reality, however, is that the CNS algorithm is not an efficient way to find collisions. In fact, it's so inefficient that it's beaten by the standard non-quantum method of finding collisions, namely "parallel rho" collision search from 1997 van Oorschot–Wiener. Let's take an in-depth look at what's going on here.

Hardware cost and time: a preliminary example

The goal is to find collisions in an n-bit hash function H, i.e., a hash function producing n-bit output. This means finding two different inputs, say x and y, such that H(x) = H(y).

The basic strategy for finding collisions, if you don't know anything special about H, is to write down many hash outputs for different inputs and hope that two of the outputs are the same. The famous "birthday paradox" says that this becomes likely to occur when the number of hash outputs grows to the scale of 2^n/2. (You can also try a smaller computation, with far fewer than 2^n/2 hash outputs, but then the success chance drops quadratically. In this blog post I'll focus on computations with a large success chance.)

You can find this collision by storing the output-input pairs in an associative array. Here's a Python script that tries 2²¹ outputs for a 40-bit hash function and has a good chance of finding a collision:

Either approach uses many megabytes of storage. Obviously storing 2^n/2 hash outputs involves massive hardware costs when n is reasonably large. But let's imagine that this is feasible. What's less obvious is how much time the computation takes.

A single access to a huge associative array takes a long time. Think about a square chip large enough to hold 2^n/2 hash outputs: simply communicating from one side of the chip to the other takes time proportional to 2^n/4. Packing data more densely into a cube reduces the distance from 2^n/4 to 2^n/6, but raises unsolved problems of getting enough energy into a cubical computing device, and getting the resulting heat out. I'll focus on conventional two-dimensional architectures in this blog post.

If one access takes time 2^n/4, do 2^n/2 accesses take time 2^3n/4? Not necessarily: the accesses can be parallelized. There are algorithms that sort 2^n/2 items on a square chip in total time proportional to 2^n/4: there are 2^n/2 miniature cores, each carrying out (say) 8⋅2^n/4 compare-exchange steps with adjacent cores. Schimmler's 4-way merge sort is an easy-to-understand algorithm of this type. More complicated algorithms reduce the constant 8.

Of course the initial 2^n/2 hash computations can also be parallelized: simply run 2^n/2 separate hash-computation cores. This won't be a bottleneck as long as the time for one hash computation is below the 2^n/4 scale for memory access.

Or run, say, 2^n/2/64 hash-computation cores, each 64 times, distributing the 64 hash outputs to 64 nearby memory units. This still won't be a bottleneck as long as the time for 64 hash computations is below the 2^n/4 scale for memory access.

To summarize, the hardware costs are on the scale of 2^n/2, or more precisely 2^n/2/64 hash-computation cores plus 2^n/2 compare-exchange cores. The time is on the scale of 2^n/4, or more precisely 64 hash computations plus 8⋅2^n/4 compare-exchange steps. The attacker is likely to save a constant factor by looking more closely at the details, including the hash-function details, but in this blog post I'll focus on the exponents: exponent n/2 for hardware cost and exponent n/4 for time.

The parallel rho method

The parallel rho method takes much less hardware and less time. Specifically, it reduces the time to 2^n/6, with hardware costs on the scale of 2^n/3. More generally, the parallel rho method offers a huge range of tradeoffs, such as time 2^n/4 using 2^n/4 hardware, or time 2^n/3 using 2^n/6 hardware.

The rho method uses a small hashing core that starts from some random input x and computes a series of hashes as quickly as possible: H(x), H²(x), H³(x), etc. Here H²(x) means H(H(x)); H³(x) means H(H(H(x))); etc. An important feature of the rho method is that most hash outputs aren't saved; we'll see later how to retroactively recognize collisions.

The core stops when it finds a distinguished point. A distinguished point, by definition, is a string that starts with d zero bits. A random string has probability 1/2^d of being distinguished, so presumably the core will find a distinguished point within about 2^d hash calls on average.

What if the function H avoids, or tends to avoid, distinguished points? To avoid bad cases, the attacker can replace H(x) with E_k(H(x)), where E is the attacker's favorite block cipher, and k is a secret key chosen at the beginning of the attack. A collision E_k(H(x)) = E_k(H(y)) is equivalent to a collision H(x) = H(y). I'll skip E_k for the rest of my description: real hash functions such as SHA-256 are experimentally observed to hit distinguished points at the expected rate.

What if the core enters a cycle before it reaches a distinguished point? For example, what happens if the core sees five different non-distinguished points x,H(x),H²(x),H³(x),H⁴(x) and then H⁵(x) turns out to be the same as H²(x)? Let's put a time limit on the computation, say 2^d+1 hash calls, so the core won't cycle forever. (It's possible, but not worthwhile, to go to more effort to find the collision between inputs H⁴(x) and H(x) in this example.)

Now build p of these small cores running in parallel, starting with random inputs x₁,x₂,...,x_p. Each core continues for at most 2^d+1 hash calls. Most of the cores will find distinguished points. The total number of hash outputs encountered by all of the cores together is proportional to 2^dp.

As before, I'll assume that the total number of hash outputs is on the scale of 2^n/2. This means that 2^dp is on the scale of 2^n/2. For example, one can take d as n/6 and p around 2^n/3; or d as n/4 and p around 2^n/4; or d as n/3 and p around 2^n/6. Most cores won't enter cycles as long as d is noticeably smaller than n/2.

Here's a Python script that simulates this process for p = 2⁵ cores, each running for at most 2¹⁷ steps, using the same 40-bit minihash shown above:

This script defines distinguished points differently: namely, strings where the first two bytes are XX. What matters is that a random string has a specified probability, in this case 1/2¹⁶, of being distinguished. The script also takes sequential inputs x₁,x₂,...,x_p rather than independent random inputs.

What happens if one non-distinguished point, say H⁷⁷⁵⁸⁴(x₇), is different from another, say H⁹²⁷⁹⁷(x₂₀), but the next hash value H⁷⁷⁵⁸⁵(x₇) collides with H⁹²⁷⁹⁸(x₂₀)? The hash value after that, H⁷⁷⁵⁸⁶(x₇), will equal H⁹²⁷⁹⁹(x₂₀), and so on. Cores 7 and 20 will end up at the same distinguished point H^77584+s(x₇) = H^92797+s(x₂₀) for some positive integer s, assuming that both cores run long enough to reach this distinguished point. If you see that this happened, and if you're also given records of x₇ and x₂₀ and the counters 77584+s,92797+s, then you can quickly find the collision by computing the difference 15213 of the counters and then checking H⁰(x₇) against H¹⁵²¹³(x₂₀), checking H¹(x₇) against H¹⁵²¹⁴(x₂₀), etc. Here's a Python script that does this, starting from the array h of distinguished points produced earlier:

In other words, to find collisions of hash values, we simply need to find collisions of distinguished points. This is useful because there are far fewer distinguished points than hash values.

The parallel rho method stores the distinguished points in an associative array, or simply sorts them, as in this last Python script. Parallel sorting takes time proportional to p^1/2 on a square chip. This becomes a bottleneck compared to the 2^d+1 hash computations as p grows past 2^n/3, but it isn't an issue for smaller values of p.

Oversimplification: counting operations

Textbooks on algorithm design generally don't teach students to measure hardware cost and time. Instead they teach students to count the number of operations. (Even worse, the number of operations is typically mislabeled "time".)

These metrics aren't completely unrelated: a chip of area A running for time T can't carry out more than AT operations. But a single operation can involve a huge chip for a significant amount of time. Students are taught, for example, that randomly accessing an array of size A is a single operation, but in fact this operation involves a chip of area A for time A^1/2.

The parallel rho method is vastly more efficient than simply storing 2^n/2 hash outputs. It's hard to see this advantage if one merely counts operations.

Students often learn to count memory usage as a secondary metric, and then they can see that the rho method reduces memory. But they still don't learn that counting operations is a poor predictor of time. They think, for example, that sorting an array is extremely efficient; they don't learn that sorting many hash outputs is much more expensive than computing those hash outputs, with the cost ratio growing as a positive power of the array size.

People who focus on minimizing the number of operations often end up with computations that are unnecessarily large and unnecessarily slow. Consider, for example, the problem of enumerating all primes up to 2¹⁰⁰. Textbook sieving makes an array of size 2¹⁰⁰, crosses off every second number, crosses off every third number, etc. This uses very few operations, but better-optimized algorithms run faster using much less hardware.

Counting operations is simple, and there is value in simplicity. When I'm analyzing and optimizing algorithms, I usually start by counting operations and making an effort to understand the minimum possible number of operations. But I know that some operations are actually much more expensive than others.

Quantum computation

The remaining algorithms that I'll discuss here are quantum algorithms. These algorithms are built from the basic operations that serious quantum-computer engineers are trying to build, specifically "Hadamard gates" and "Toffoli gates" applied to large numbers of "qubits".

The quantum-computer engineers haven't given us these operations yet. There's a company named D-Wave selling "quantum computers"; but my understanding of the scientific consensus is that the current D-Wave computers can be much more cost-effectively simulated by traditional computers and are therefore useless. On the other hand, D-Wave is

Google says that it is building a 49-qubit "quantum computer" that, if it works, will provide "quantum supremacy". What this means is that we don't know any way for a traditional computer to accurately simulate the output of the device as cost-effectively as running the device. But, except for the marketing buzzwords, there's nothing new about devices meeting this notion of "quantum supremacy". Anyone can

More to the point, Google's chip still won't be able to run any of the quantum algorithms discussed below.

I don't mean to dismiss the risk of a coming quantum apocalypse. I see many signs of progress in quantum computing. I've recently bet $2048 that the RSA-2048 challenge will be publicly factored by a quantum computer by 2033.

This doesn't mean, however, that quantum computers are going to magically speed up all computations. Big quantum speedups, such as Shor's speedup for integer factorization, have been found for only a few special types of computations. Most quantum speedups are only moderate, such as Grover's method to speed up brute-force searches, which I'll discuss in a moment. Many more computations don't seem to be sped up at all; this is particularly common for computations involving large amounts of data.

Grover's method

People frequently say that quantum computers will force us to move from AES-128 to AES-256. The reason is Grover's method.

Suppose a protocol reveals the AES-128 encryption of a known plaintext block under a secret key. Many protocols do this, and of course AES-128 should remain secure in this known-plaintext scenario. How quickly can an attacker find the key? The obvious attack is to search through all 2¹²⁸ keys; on average this will succeed after about 2¹²⁷ keys. This is too expensive to be a concern today.

Grover's method sounds much more worrisome: it searches through all 2¹²⁸ keys using just 2⁶⁴ quantum evaluations of AES. But this operation count is only the beginning of the analysis. Given current quantum-computer designs, it is reasonable to predict that a quantum evaluation of AES is going to cost considerably more hardware and time than a non-quantum evaluation of AES. Furthermore, Grover's method performs its 2⁶⁴ iterations serially. If one quantum evaluation of AES takes a microsecond, and the attacker wants results within a decade, then Grover's method is limited to 2⁴⁸ iterations, so it searches only 2⁹⁶ keys. Searching all 2¹²⁸ keys requires 2³² parallel copies of this computation.

Breaking AES-128 with Grover's method is much less efficient than breaking RSA-2048 with Shor's algorithm. It could be so inefficient as to be infeasible. Even worse, for reasonable time limits, the speedup from Grover's method could be wiped out by the overhead of quantum operations compared to non-quantum operations. This means that Grover's method could be even more expensive than non-quantum attacks against AES-128. In other words, even if Shor's algorithm is successful, Grover's method could turn out to be useless. At the moment NIST sounds quite skeptical about Grover's method:

All of the quantum algorithms that I'm going to discuss are based on generalizations and applications of Grover's method. If Grover's method turns out to be useless then these algorithms will be useless too. But I'm going to be more optimistic for the attacker: I'm going to assume that quantum overheads can be reduced far enough to make Grover's method useful.

The Brassard–Høyer–Tapp algorithm

Grover's paper was posted in 1996. The next year, Brassard, Høyer, and Tapp announced that they could find collisions in an n-bit hash function using only about 2^n/3 quantum evaluations of the function. I'll write "BHT" for Brassard, Høyer, and Tapp.

This 2^n/3 sounds like a solid improvement compared to the non-quantum 2^n/2. The exponent is dropping by a factor 1.5: not as large as Grover's factor 2, but still important for deciding how big n needs to be for security. This is why SIDH uses 768-bit fields instead of 512-bit fields, for example. But looking at hardware cost and time tells a completely different story.

Imagine that there is some magical way to quickly carry out comparisons to a giant precomputed list of targets H(x₁),H(x₂),...,H(x_T). The series of 2^n/3 quantum evaluations of f still includes a series of 2^n/3 quantum evaluations of H, taking time on the scale of 2^n/3. The hardware cost of this algorithm is also on the scale of 2^n/3.

For comparison, parallel rho finds a collision in the time for 2^n/4 non-quantum evaluations of H, using just 2^n/4 cores. For reasonable functions H, this is much less time and much less hardware than the BHT algorithm. (The picture could change if H cores are gigantic, for example if H is a "password hashing" function, but then the overheads in Grover's method will also be gigantic.)

In short, the BHT algorithm is a step backwards from the non-quantum parallel rho algorithm. I pointed out this problem in a 2009 paper. I also analyzed some variants of the BHT algorithm but couldn't find any variants better than parallel rho:

I'm not aware of any dispute about any part of my algorithm analysis. NIST, in its call for post-quantum submissions, commented that the best algorithm known to find an AES-128 key takes "2¹⁷⁰/MAXDEPTH quantum gates or 2¹⁴³ classical gates", while the best algorithm known to find a SHA3-256 collision takes "2¹⁴⁶ classical gates" with no quantum speedup.

Applying the BHT algorithm to occasional hash values

This brings me to the new CNS paper, "An efficient quantum collision search algorithm and implications on symmetric cryptography". Let's look at what this paper does.

The CNS paper has a parameter r. An example highlighted in the paper is r = n/5. Define H_r(x) as follows: search through r-bit strings s to find strings H(x,s) ending with r zero bits; output a random such string, assuming at least one exists. Using Grover's method one can compute H_r(x) using approximately 2^r/2 operations. The paper's strategy to find a collision in H is to find a collision in H_r.

Let's try a sanity check at this point. Instead of applying parallel rho to an n-bit hash function, do we gain anything by applying parallel rho to an (n-r)-bit hash function that takes 2^r/2 times as long to compute? As before, let's ignore quantum overhead; and let's assume that we can work around the complications of H_r being non-deterministic and often undefined.

Say we can afford 2^n/4 hardware. Applying rho to the n-bit hash function takes time 2^n/4. Applying rho to an equally fast (n-r)-bit hash function would take time only 2^n/4-r/2, saving a factor 2^r/2. But we actually have a function that's 2^r/2 times slower, so the savings disappears. There's no advantage here.

Back to the paper. The paper applies the BHT algorithm to H_r: i.e., it precomputes H_r(x₁),...,H_r(x_T) and then uses Grover's method to search for y colliding with one of x₁,...,x_T under H_r. (T is "2^t-r" in the paper's notation.)

If BHT can't beat parallel rho, and parallel rho applied to H_r can't beat parallel rho applied to H, then how can BHT applied to H_r beat parallel rho applied to H? (As mentioned above, BHT doesn't look so bad if H cores are gigantic, but replacing H with H_r doesn't move in this direction.)

Let's go through the analysis of time and hardware cost. Presumably y collides with x_i under H_r with probability about 1/2^n-r, since r bits of H_r are already forced to be 0; and thus collides with at least one of x₁,...,x_T with probability about T/2^n-r. Grover's method thus involves 2^(n-r)/2/T^1/2 iterations (or "2^(n-t)/2" in the paper's notation).

Each iteration involves, among other things, an evaluation of H_r, which is 2^r/2 iterations of H. Grover's method thus takes time at least 2^n/2/T^1/2. Meanwhile the hardware cost is on the scale of T. At this point r has disappeared: for example, the case T = 2^n/3 takes time at least 2^n/3 with hardware cost 2^n/3, just like the original BHT case that I described above. This is again beaten by parallel rho.

CNS false advertising, part 1

The CNS paper claims that if "time-space product (including classical space) ... is the quantity of interest then we can take s = n/5 in our quantum parallel algorithm and we will obtain a time-space product of Õ(2^12n/25) ... which again beats the best classical algorithms with this benchmarking." Jean-Philippe Aumasson tweeted that this 12n/25 was interesting.

However, the 12n/25 claim is simply wrong. The time-space product actually has exponent 13n/25, which is worse than the n/2 from parallel rho. Let's look at the details.

As background, the best BHT variant considered in my 2009 paper (page 8: "quantum analogue of the more sophisticated size-M machine discussed above") was to compute H(y₁),...,H(y_T) in parallel and check for any collisions with H(x₁),...,H(x_T). This variant also doesn't beat parallel rho, although it does match parallel rho if communication costs are ignored.

The CNS paper, not crediting my paper, similarly considers "2^s" cores computing H_r(y₁),H_r(y₂),... in parallel. Instead of one core using 2^(n-r)/2/T^1/2 iterations, there are 2^s cores each using 2^(n-r-s)/2/T^1/2 iterations. Each iteration takes time 2^r/2 as before, so each core uses time 2^(n-s)/2/T^1/2.

Rather than sorting, the paper applies each of H_r(x₁),...,H_r(x_T) successively to each of the quantum cores. This takes time T, asymptotically slower than the T^1/2 for sorting, although it has the virtue of not requiring quantum communication between the quantum cores. The paper takes T specifically as 2^r/2 to balance these T steps against the cost of an H_r computation. Each core thus uses time 2^(n-s)/2-r/4.

A reasonable way to handle the communication here would be to put H_r(x₁),...,H_r(x_T) into a giant "shift register", a specialized T-core device that gradually rotates its input. If the first 2^s of these cores are attached to the 2^s quantum cores, then each of the quantum cores sees each of H_r(x₁),...,H_r(x_T) within exactly T steps. I'm assuming here that 2^s is at most T.

The initial computation of H_r(x₁),...,H_r(x_T) takes 2^r/2 T = 2^r steps, split as 2^r-s steps on each of the 2^s quantum cores. To balance this against the subsequent time 2^(n-s)/2-r/4, the paper takes r = 2n/5 + 2s/5. The time is now 2^2n/5-3s/5. The assumption that 2^s is at most T translates into s being at most n/4.

The CNS paper takes s = n/5. The time is then 2^7n/25. There are 2^n/5 quantum cores, plus T = 2^6n/25 cores in the shift register. The "time-space product (including classical space)" is the product of 2^7n/25 and 2^6n/25, namely 2^13n/25.

How does the CNS paper claim 2^12n/25? Details of the calculation aren't provided. Presumably the authors counted only the 2^n/5 quantum cores, and forgot to count the 2^6n/25 non-quantum cores storing 2^6n/25 hash outputs, even though they said they were (properly) "including classical space". Oops.

CNS false advertising, part 2

It's common for a paper saying "Algorithm A is beaten by Algorithm B" to be followed by a paper saying "Algorithm B is beaten by Algorithm C". The paper introducing Algorithm C typically tries to argue that Algorithm C is surprising. Occasionally there's an easy argument for this: namely, someone wrote "I conjecture that B is the best algorithm". Typically this conjecture happens because the author writing "Algorithm A is beaten by Algorithm B" thought through avenues of further improvements, and decided that Algorithm B was one of the rare algorithms that couldn't be improved.

Part of the CNS advertising is that its collision algorithm has lower cost than parallel rho. This is false advertising, as I explained above. Another part of the CNS advertising is that parallel rho was conjectured—even more extreme, claimed—to be optimal. This is also false advertising, as I'll explain now.

The first public version of the CNS paper (3 Sep 2017 on eprint) included the following fabrications: first, that in my 2009 paper I "claimed that quantum computers would not make any improvement over on-purpose hardware for collision search", second, that I "remarked that all post-quantum algorithms for collision search had an effective cost of at least 2^n/2".

In fact, my paper clearly and repeatedly states that it is analyzing the costs of known algorithms. Here's the quote again:

I didn't make any claims regarding "all post-quantum algorithms for collision search"; I explicitly referred to all known quantum algorithms for collision search. I made one narrow conjecture regarding a specific class of collision algorithms that I had thought about ("There are several obvious ways to combine quantum search with the rho method, but I have not found any such combinations that improve performance, and I conjecture that—in a suitable generic model—no such improvements are possible"); I avoided making broader conjectures.

After seeing the CNS paper, I privately asked CNS for an erratum. CNS removed the fabrications from a subsequent update of their paper. However, removing a false statement isn't the same as admitting that it's false. There's still no erratum.

This might not seem important as long as nobody has managed to beat 2^n/2. But that's not the point. I devote a tremendous amount of effort to carefully assessing cryptographic risks and other security risks. I have very high standards for security claims, security conjectures, and security recommendations. It was wrong for the CNS paper to misrepresent my paper as making a security claim that my paper does not, in fact, make.

Interlude: multi-target preimages

The collision problem can be viewed as a simplification of another problem that shows up very frequently, namely the multi-target preimage problem.

In the multi-target preimage problem, one is given H(x₁),...,H(x_T), and the goal is to find some y such that H(y) is in the set {H(x₁),...,H(x_T)}. A multi-target preimage attack immediately implies a collision attack (assuming H is reasonably far from being injective), so the multi-target preimage problem is at least as difficult as the collision problem, perhaps more difficult. At the end of this blog post I'll say more about the gap between these problems.

Part of the security analysis in the SPHINCS paper considers the post-quantum security of multi-target preimages. "Bernstein [8] concludes that all known post-quantum collision-finding algorithms cost at least 2^n/2, implying that the post-quantum cost of multi-target preimage attacks is also 2^n/2", the paper states. "For example, for n = 256 and T = 2⁵⁶ the best post-quantum attacks use only 2¹⁰⁰ queries but still cost 2¹²⁸."

The cost metric used here is again price-performance ratio, i.e., the product of time and hardware cost. Obviously time and hardware cost can each be reduced below 2^n/2: the parallel rho method does this for collision search, and parallel Grover does this for preimage search. But nobody has published an algorithm where the product is below 2^n/2.

CNS false advertising, part 3

The CNS paper advertises multi-target preimage search in very much the same way that it advertises collision search: it falsely attributes conjectures to previous papers that did not in fact state those conjectures; it then falsely claims to disprove those conjectures.

Specifically, the CNS paper considers the same 2⁵⁶-target 256-bit preimage-search example used in the SPHINCS paper. The CNS paper states "they claim that the best quantum algorithm has a cost of 2¹²⁸". This omits a critical piece of context, the word "known" from the SPHINCS paper. Someone reading the SPHINCS paper finds an undisputed analysis of known algorithms; someone reading the summary in the CNS paper incorrectly thinks that the SPHINCS paper was making a claim regarding all algorithms.

The CNS paper then says that "it is possible to attack their example with a time complexity of ... 2^119.6". What exactly is this supposed to be contradicting? Cost in the SPHINCS paper is not time: it is the product of time and hardware cost. The SPHINCS paper does not make the silly claim that time alone is at least 2^n/2: as noted above, parallel rho beats time 2^n/2 for collision search, and parallel Grover beats time 2^n/2 for preimage search.

The CNS attack does not beat cost 2¹²⁸ for n=256: it uses 2^119.6 quantum operations times megabytes of memory. The CNS attack beats time 2¹²⁸ (if the quantum overhead is low enough), but this is not new.

CNS vagueness

Beyond the clear errors described above, the CNS paper has various claims of novelty whose meaning is unclear, making the claims difficult to evaluate.

The abstract claims that the paper has "the first proof of an actual quantum time speedup for collision search". What do CNS think they mean by "actual time"?

The CNS paper claims that its parallel algorithms "are the first ones to significantly improve on the best classical algorithms" for collision search, and provide "better performance than classical algorithms for a wide range [of] parameters." In precisely what metric do they think they have an improvement compared to non-quantum collision-finding algorithms?

Of course one can point to, e.g., the number of H queries as a metric where the CNS algorithm is faster than non-quantum collision-finding algorithms. But then the claim to be "first" is wrong: the BHT algorithm already does better in this metric. The CNS paper also makes this claim for multi-target preimage search, but how can they say this when non-quantum preimage algorithms are already beaten by Grover's method?

The CNS paper also claims to "solve an open problem and contradict a conjecture on the complexity of quantum collision and multi-target preimage search ... We ask the reader to keep in mind that these results seemed particularly surprising as it was conjectured that quantum algorithms for solving these problems wouldn't be more efficient than classical ones." What exactly is the "open problem" that the authors claim to be solving? Where is the "conjecture" that they claim to be contradicting? And, again, why are they ignoring Grover's method in the case of preimage search?

Multi-target preimage search, revisited

Let me close by returning to the gap between the complexities of collision search and multi-target preimage search.

In pre-quantum cryptography, multi-target preimage search is quite worrisome. It's the main reason that I tell people to upgrade from 128-bit ciphers to 256-bit ciphers. But multi-target preimage search is certainly not as fast as collision search.

In post-quantum cryptography, single-target preimage search suddenly drops down to 2^n/2 operations. Multi-target preimage search is therefore also 2^n/2 operations: simply focus on the first target. This is just as few operations as collision search, if we ignore quantum overhead.

The post-quantum situation becomes more subtle if we look at the whole tradeoff curve between hardware cost and time. For example, if the time limit is 2^n/3:

This leaves a range of possibilities between 2^n/6 and 2^n/3 for the hardware cost of multi-target preimage search.

Banegas and I published a paper on this topic at SAC 2017. There are various epsilon-power overheads in the paper that need to be analyzed in more detail, but the asymptotic exponents are clear: compared to single-target preimage search, we reduce the hardware cost by a factor T^1/2-ε for T targets. (This breaks down when T is beyond the hardware size, but let's focus on the important case of a well-equipped attacker who has more hardware than targets.) Equivalently, we save a factor T^1/4-ε in time (with a different ε) for the same hardware cost.

If long-distance communication were free then we would do even better, reducing the hardware cost by a factor close to T. The idea here, as in my 2009 paper, is to make T guesses y₁,...,y_T in parallel, and then check whether any of H(y₁),...,H(y_T) match any of H(x₁),...,H(x_T), by sorting the whole list. This has chance T²/2ⁿ of success, and has hardware cost on the scale of T. Free communication means that parallel sorting takes very little time, so with time limit 2^n/3 we can afford 2^n/3 Grover iterations, checking 2^2n/3 possibilities for the vector (y₁,...,y_T) and thus increasing the success chance from T²/2ⁿ to T²/2^n/3. Running the same thing 2^n/3/T² times in parallel has a good chance of success, and hardware cost on the scale of 2^n/3/T.

With realistic communication costs, sorting slows down to T^1/2. Losing a factor T^1/2 in the number of Grover iterations would reduce the success chance to T/2^n/3, not beating single-target preimage search for the same amount of hardware. To do better, we combine Grover's method with the rho method: we compute roughly T^1/2 iterations of H on each of y₁,...,y_T,x₁,...,x_T, stopping at distinguished points, and sort the results. The paper explains how we do all of this reversibly, as required for Grover's method. In the end we save a factor T^1/2-ε in hardware cost for the same time limit.

The CNS paper instead compares H_r(y) to H(x₁),...,H(x_T). Since H_r(y) starts with r zero bits, there's no point in checking H(x_i) if H(x_i) doesn't start with r zero bits. CNS thus begin by paring the list H(x₁),...,H(x_T) down to a reduced list of approximately T/2^r targets that do start with r zero bits. Each guess for y thus has chance about (T/2^r)/2^n-r = T/2ⁿ of success. Grover's method uses 2^(n-s)/2/T^1/2 iterations on each of 2^s parallel cores.

The CNS paper takes r as (2/3) log₂ T so that serially checking T/2^r = T^1/3 targets is balanced against the cost of an H_r computation. Each core thus takes time T^1/3 2^(n-s)/2/T^1/2 = 2^(n-s)/2/T^1/6. To bring the time down to, say, 2^n/3, one must take 2^s as 2^n/3/T^1/3, and then the hardware cost is 2^n/3/T^1/3. This saves a factor T^1/3 in hardware cost for the same time limit. (As before, this breaks down when T becomes very large.) My result with Banegas is better, at least asymptotically: T^1/2-ε is more of a savings than T^1/3.

Banegas and I put our paper online the week of SAC 2017. The CNS paper was posted a few weeks later, but the authors say that it was written independently, and there's no reason to question this. (The SAC and Asiacrypt submission deadlines were months earlier.) The CNS paper could and should have simply reported its straightforward T^1/3 improvement for multi-target preimage search, without the false advertising discussed above. But then the paper wouldn't have been accepted to Asiacrypt.

The cr.yp.to blog