# The cr.yp.to blog

## 2017.10.17: Quantum algorithms to find collisions

"An efficient quantum collision search algorithm and implications on symmetric cryptography" is a new paper from André Chailloux, Marı́a Naya-Plasencia, and André Schrottenloher. I'll write "CNS" for the list of authors. The CNS paper was accepted to Asiacrypt 2017; Asiacrypt is one of the three yearly flagship conferences of the International Association for Cryptologic Research. The paper appeared online last month.

The title sounds scary, doesn't it? Collision-resistance is one of the primary features promised by cryptographic hash functions such as SHA-256 and SHA3-256. The CNS algorithm is generic: it isn't specific to any particular hash function, so it applies to SHA-256 and SHA3-256 and everything else that comes to mind. Some signature systems are designed to be "collision-resilient", meaning that they aren't broken by collisions in the underlying hash function; but collision-finding is a basic subroutine inside many more attack algorithms. An efficient generic algorithm for collision search would have huge ramifications.

The reality, however, is that the CNS algorithm is not an efficient way to find collisions. In fact, it's so inefficient that it's beaten by the standard non-quantum method of finding collisions, namely "parallel rho" collision search from 1997 van Oorschot–Wiener. Let's take an in-depth look at what's going on here.

### Hardware cost and time: a preliminary example

The goal is to find collisions in an n-bit hash function H, i.e., a hash function producing n-bit output. This means finding two different inputs, say x and y, such that H(x) = H(y).

The basic strategy for finding collisions, if you don't know anything special about H, is to write down many hash outputs for different inputs and hope that two of the outputs are the same. The famous "birthday paradox" says that this becomes likely to occur when the number of hash outputs grows to the scale of 2n/2. (You can also try a smaller computation, with far fewer than 2n/2 hash outputs, but then the success chance drops quadratically. In this blog post I'll focus on computations with a large success chance.)

You can find this collision by storing the output-input pairs in an associative array. Here's a Python script that tries 221 outputs for a 40-bit hash function and has a good chance of finding a collision:

```     import hashlib
def minihash(seed): # 40-bit hash
h = hashlib.sha256()
h.update(seed)
return h.digest()[:5]

import random
def randomhash():
r = hex(random.randrange(2**100))
return (minihash(r),r)

d = dict()
for i in range(2**21):
h,r = randomhash()
if h in d:
print r,d[h]
break
d[h] = r
```

Alternatively, you can sort the output-input pairs:

```     h = [randomhash() for i in range(2**21)]
h.sort()
for i in range(1,len(h)):
if h[i] == h[i-1]:
print h[i],h[i-1]
```

Either approach uses many megabytes of storage. Obviously storing 2n/2 hash outputs involves massive hardware costs when n is reasonably large. But let's imagine that this is feasible. What's less obvious is how much time the computation takes.

A single access to a huge associative array takes a long time. Think about a square chip large enough to hold 2n/2 hash outputs: simply communicating from one side of the chip to the other takes time proportional to 2n/4. Packing data more densely into a cube reduces the distance from 2n/4 to 2n/6, but raises unsolved problems of getting enough energy into a cubical computing device, and getting the resulting heat out. I'll focus on conventional two-dimensional architectures in this blog post.

If one access takes time 2n/4, do 2n/2 accesses take time 23n/4? Not necessarily: the accesses can be parallelized. There are algorithms that sort 2n/2 items on a square chip in total time proportional to 2n/4: there are 2n/2 miniature cores, each carrying out (say) 8⋅2n/4 compare-exchange steps with adjacent cores. Schimmler's 4-way merge sort is an easy-to-understand algorithm of this type. More complicated algorithms reduce the constant 8.

Of course the initial 2n/2 hash computations can also be parallelized: simply run 2n/2 separate hash-computation cores. This won't be a bottleneck as long as the time for one hash computation is below the 2n/4 scale for memory access.

Or run, say, 2n/2/64 hash-computation cores, each 64 times, distributing the 64 hash outputs to 64 nearby memory units. This still won't be a bottleneck as long as the time for 64 hash computations is below the 2n/4 scale for memory access.

To summarize, the hardware costs are on the scale of 2n/2, or more precisely 2n/2/64 hash-computation cores plus 2n/2 compare-exchange cores. The time is on the scale of 2n/4, or more precisely 64 hash computations plus 8⋅2n/4 compare-exchange steps. The attacker is likely to save a constant factor by looking more closely at the details, including the hash-function details, but in this blog post I'll focus on the exponents: exponent n/2 for hardware cost and exponent n/4 for time.

### The parallel rho method

The parallel rho method takes much less hardware and less time. Specifically, it reduces the time to 2n/6, with hardware costs on the scale of 2n/3. More generally, the parallel rho method offers a huge range of tradeoffs, such as time 2n/4 using 2n/4 hardware, or time 2n/3 using 2n/6 hardware.

The rho method uses a small hashing core that starts from some random input x and computes a series of hashes as quickly as possible: H(x), H2(x), H3(x), etc. Here H2(x) means H(H(x)); H3(x) means H(H(H(x))); etc. An important feature of the rho method is that most hash outputs aren't saved; we'll see later how to retroactively recognize collisions.

The core stops when it finds a distinguished point. A distinguished point, by definition, is a string that starts with d zero bits. A random string has probability 1/2d of being distinguished, so presumably the core will find a distinguished point within about 2d hash calls on average.

What if the function H avoids, or tends to avoid, distinguished points? To avoid bad cases, the attacker can replace H(x) with Ek(H(x)), where E is the attacker's favorite block cipher, and k is a secret key chosen at the beginning of the attack. A collision Ek(H(x)) = Ek(H(y)) is equivalent to a collision H(x) = H(y). I'll skip Ek for the rest of my description: real hash functions such as SHA-256 are experimentally observed to hit distinguished points at the expected rate.

What if the core enters a cycle before it reaches a distinguished point? For example, what happens if the core sees five different non-distinguished points x,H(x),H2(x),H3(x),H4(x) and then H5(x) turns out to be the same as H2(x)? Let's put a time limit on the computation, say 2d+1 hash calls, so the core won't cycle forever. (It's possible, but not worthwhile, to go to more effort to find the collision between inputs H4(x) and H(x) in this example.)

Now build p of these small cores running in parallel, starting with random inputs x1,x2,...,xp. Each core continues for at most 2d+1 hash calls. Most of the cores will find distinguished points. The total number of hash outputs encountered by all of the cores together is proportional to 2dp.

As before, I'll assume that the total number of hash outputs is on the scale of 2n/2. This means that 2dp is on the scale of 2n/2. For example, one can take d as n/6 and p around 2n/3; or d as n/4 and p around 2n/4; or d as n/3 and p around 2n/6. Most cores won't enter cycles as long as d is noticeably smaller than n/2.

Here's a Python script that simulates this process for p = 25 cores, each running for at most 217 steps, using the same 40-bit minihash shown above:

```     start = random.randrange(2**100)
h = []
for core in range(2**5):
# all cores can run in parallel
seed = hex(start + core)
r,j = seed,0
while j < 2**17:
r,j = minihash(r),j+1
if r[:2] == 'XX':
h += [(r,seed,j)]
break
```

This script defines distinguished points differently: namely, strings where the first two bytes are XX. What matters is that a random string has a specified probability, in this case 1/216, of being distinguished. The script also takes sequential inputs x1,x2,...,xp rather than independent random inputs.

What happens if one non-distinguished point, say H77584(x7), is different from another, say H92797(x20), but the next hash value H77585(x7) collides with H92798(x20)? The hash value after that, H77586(x7), will equal H92799(x20), and so on. Cores 7 and 20 will end up at the same distinguished point H77584+s(x7) = H92797+s(x20) for some positive integer s, assuming that both cores run long enough to reach this distinguished point. If you see that this happened, and if you're also given records of x7 and x20 and the counters 77584+s,92797+s, then you can quickly find the collision by computing the difference 15213 of the counters and then checking H0(x7) against H15213(x20), checking H1(x7) against H15214(x20), etc. Here's a Python script that does this, starting from the array h of distinguished points produced earlier:

```     import binascii
def hexstr(s):
return binascii.hexlify(s)

h.sort()
for i in range(1,len(h)):
if h[i] == h[i-1]:
r0,j0 = h[i],-h[i]
r1,j1 = h[i-1],-h[i-1]
for i in range(max(j0,j1),0):
while j0 < i:
r0,j0 = minihash(r0),j0+1
while j1 < i:
r1,j1 = minihash(r1),j1+1
if minihash(r0) == minihash(r1):
print hexstr(r0),hexstr(r1)
break
```

In other words, to find collisions of hash values, we simply need to find collisions of distinguished points. This is useful because there are far fewer distinguished points than hash values.

The parallel rho method stores the distinguished points in an associative array, or simply sorts them, as in this last Python script. Parallel sorting takes time proportional to p1/2 on a square chip. This becomes a bottleneck compared to the 2d+1 hash computations as p grows past 2n/3, but it isn't an issue for smaller values of p.

### Oversimplification: counting operations

Textbooks on algorithm design generally don't teach students to measure hardware cost and time. Instead they teach students to count the number of operations. (Even worse, the number of operations is typically mislabeled "time".)

These metrics aren't completely unrelated: a chip of area A running for time T can't carry out more than AT operations. But a single operation can involve a huge chip for a significant amount of time. Students are taught, for example, that randomly accessing an array of size A is a single operation, but in fact this operation involves a chip of area A for time A1/2.

The parallel rho method is vastly more efficient than simply storing 2n/2 hash outputs. It's hard to see this advantage if one merely counts operations.

Students often learn to count memory usage as a secondary metric, and then they can see that the rho method reduces memory. But they still don't learn that counting operations is a poor predictor of time. They think, for example, that sorting an array is extremely efficient; they don't learn that sorting many hash outputs is much more expensive than computing those hash outputs, with the cost ratio growing as a positive power of the array size.

People who focus on minimizing the number of operations often end up with computations that are unnecessarily large and unnecessarily slow. Consider, for example, the problem of enumerating all primes up to 2100. Textbook sieving makes an array of size 2100, crosses off every second number, crosses off every third number, etc. This uses very few operations, but better-optimized algorithms run faster using much less hardware.

Counting operations is simple, and there is value in simplicity. When I'm analyzing and optimizing algorithms, I usually start by counting operations and making an effort to understand the minimum possible number of operations. But I know that some operations are actually much more expensive than others.

### Quantum computation

The remaining algorithms that I'll discuss here are quantum algorithms. These algorithms are built from the basic operations that serious quantum-computer engineers are trying to build, specifically "Hadamard gates" and "Toffoli gates" applied to large numbers of "qubits".

The quantum-computer engineers haven't given us these operations yet. There's a company named D-Wave selling "quantum computers"; but my understanding of the scientific consensus is that the current D-Wave computers can be much more cost-effectively simulated by traditional computers and are therefore useless. On the other hand, D-Wave is

• collecting venture capital,
• successfully selling some of its current useless machines,
• collecting engineering expertise that could be relevant to useful machines someday, and
• not getting into any trouble for deceiving people.

Doesn't look like the worst investment!

Google says that it is building a 49-qubit "quantum computer" that, if it works, will provide "quantum supremacy". What this means is that we don't know any way for a traditional computer to accurately simulate the output of the device as cost-effectively as running the device. But, except for the marketing buzzwords, there's nothing new about devices meeting this notion of "quantum supremacy". Anyone can

• declare pretty much any arrangement of atoms to be a "quantum computer",
• point out that we don't know fast algorithms to accurately simulate our observations of this "computer", and
• declare "quantum supremacy".

More to the point, Google's chip still won't be able to run any of the quantum algorithms discussed below.

I don't mean to dismiss the risk of a coming quantum apocalypse. I see many signs of progress in quantum computing. I've recently bet \$2048 that the RSA-2048 challenge will be publicly factored by a quantum computer by 2033.

This doesn't mean, however, that quantum computers are going to magically speed up all computations. Big quantum speedups, such as Shor's speedup for integer factorization, have been found for only a few special types of computations. Most quantum speedups are only moderate, such as Grover's method to speed up brute-force searches, which I'll discuss in a moment. Many more computations don't seem to be sped up at all; this is particularly common for computations involving large amounts of data.

### Grover's method

People frequently say that quantum computers will force us to move from AES-128 to AES-256. The reason is Grover's method.

Suppose a protocol reveals the AES-128 encryption of a known plaintext block under a secret key. Many protocols do this, and of course AES-128 should remain secure in this known-plaintext scenario. How quickly can an attacker find the key? The obvious attack is to search through all 2128 keys; on average this will succeed after about 2127 keys. This is too expensive to be a concern today.

Grover's method sounds much more worrisome: it searches through all 2128 keys using just 264 quantum evaluations of AES. But this operation count is only the beginning of the analysis. Given current quantum-computer designs, it is reasonable to predict that a quantum evaluation of AES is going to cost considerably more hardware and time than a non-quantum evaluation of AES. Furthermore, Grover's method performs its 264 iterations serially. If one quantum evaluation of AES takes a microsecond, and the attacker wants results within a decade, then Grover's method is limited to 248 iterations, so it searches only 296 keys. Searching all 2128 keys requires 232 parallel copies of this computation.

Breaking AES-128 with Grover's method is much less efficient than breaking RSA-2048 with Shor's algorithm. It could be so inefficient as to be infeasible. Even worse, for reasonable time limits, the speedup from Grover's method could be wiped out by the overhead of quantum operations compared to non-quantum operations. This means that Grover's method could be even more expensive than non-quantum attacks against AES-128. In other words, even if Shor's algorithm is successful, Grover's method could turn out to be useless. At the moment NIST sounds quite skeptical about Grover's method:

When all of these considerations are taken into account, it becomes quite likely that variants of Grover's algorithm will provide no advantage to an adversary wishing to perform a cryptanalytic attack that can be completed in a matter of years, or even decades.

All of the quantum algorithms that I'm going to discuss are based on generalizations and applications of Grover's method. If Grover's method turns out to be useless then these algorithms will be useless too. But I'm going to be more optimistic for the attacker: I'm going to assume that quantum overheads can be reduced far enough to make Grover's method useful.

### The Brassard–Høyer–Tapp algorithm

Grover's paper was posted in 1996. The next year, Brassard, Høyer, and Tapp announced that they could find collisions in an n-bit hash function using only about 2n/3 quantum evaluations of the function. I'll write "BHT" for Brassard, Høyer, and Tapp.

This 2n/3 sounds like a solid improvement compared to the non-quantum 2n/2. The exponent is dropping by a factor 1.5: not as large as Grover's factor 2, but still important for deciding how big n needs to be for security. This is why SIDH uses 768-bit fields instead of 512-bit fields, for example. But looking at hardware cost and time tells a completely different story.

The BHT algorithm works as follows:

• Compute T = 2n/3 targets: hash values H(x1),H(x2),...,H(xT) where x1,x2,...,xT are chosen randomly. Assume for simplicity that these don't collide; they aren't likely to.
• Define f(y) as 1 if y collides with at least one of x1,x2,...,xT under H. In other words, define f(y) as 1 if H(y) is in the set {H(x1),H(x2),...,H(xT)} while y is not in the set {x1,x2,...,xT}. Define f(y) as 0 otherwise.
• Notice that computing f involves only one evaluation of H, since H(x1),H(x2),...,H(xT) are already known at this point. Notice also that a preimage of 1 under f, an input y such that f(y) = 1, reveals a collision in H.
• Use Grover's method to find a preimage of 1 under this function f. Within 2n/3 steps, Grover's method searches 22n/3 possible preimages; this is likely to be enough.

Imagine that there is some magical way to quickly carry out comparisons to a giant precomputed list of targets H(x1),H(x2),...,H(xT). The series of 2n/3 quantum evaluations of f still includes a series of 2n/3 quantum evaluations of H, taking time on the scale of 2n/3. The hardware cost of this algorithm is also on the scale of 2n/3.

For comparison, parallel rho finds a collision in the time for 2n/4 non-quantum evaluations of H, using just 2n/4 cores. For reasonable functions H, this is much less time and much less hardware than the BHT algorithm. (The picture could change if H cores are gigantic, for example if H is a "password hashing" function, but then the overheads in Grover's method will also be gigantic.)

In short, the BHT algorithm is a step backwards from the non-quantum parallel rho algorithm. I pointed out this problem in a 2009 paper. I also analyzed some variants of the BHT algorithm but couldn't find any variants better than parallel rho:

The point of this paper is that all known quantum algorithms to find collisions in hash functions are less cost-effective than traditional cryptanalytic hardware, even under optimistic assumptions regarding the speed of quantum computers. ... Within the space of known quantum collision algorithms, the most cost-effective algorithms are tantamount to non-quantum algorithms, and it is clear that non-quantum algorithms should be implemented with standard bits rather than with qubits.

I'm not aware of any dispute about any part of my algorithm analysis. NIST, in its call for post-quantum submissions, commented that the best algorithm known to find an AES-128 key takes "2170/MAXDEPTH quantum gates or 2143 classical gates", while the best algorithm known to find a SHA3-256 collision takes "2146 classical gates" with no quantum speedup.

### Applying the BHT algorithm to occasional hash values

This brings me to the new CNS paper, "An efficient quantum collision search algorithm and implications on symmetric cryptography". Let's look at what this paper does.

The CNS paper has a parameter r. An example highlighted in the paper is r = n/5. Define Hr(x) as follows: search through r-bit strings s to find strings H(x,s) ending with r zero bits; output a random such string, assuming at least one exists. Using Grover's method one can compute Hr(x) using approximately 2r/2 operations. The paper's strategy to find a collision in H is to find a collision in Hr.

Let's try a sanity check at this point. Instead of applying parallel rho to an n-bit hash function, do we gain anything by applying parallel rho to an (n-r)-bit hash function that takes 2r/2 times as long to compute? As before, let's ignore quantum overhead; and let's assume that we can work around the complications of Hr being non-deterministic and often undefined.

Say we can afford 2n/4 hardware. Applying rho to the n-bit hash function takes time 2n/4. Applying rho to an equally fast (n-r)-bit hash function would take time only 2n/4-r/2, saving a factor 2r/2. But we actually have a function that's 2r/2 times slower, so the savings disappears. There's no advantage here.

Back to the paper. The paper applies the BHT algorithm to Hr: i.e., it precomputes Hr(x1),...,Hr(xT) and then uses Grover's method to search for y colliding with one of x1,...,xT under Hr. (T is "2t-r" in the paper's notation.)

If BHT can't beat parallel rho, and parallel rho applied to Hr can't beat parallel rho applied to H, then how can BHT applied to Hr beat parallel rho applied to H? (As mentioned above, BHT doesn't look so bad if H cores are gigantic, but replacing H with Hr doesn't move in this direction.)

Let's go through the analysis of time and hardware cost. Presumably y collides with xi under Hr with probability about 1/2n-r, since r bits of Hr are already forced to be 0; and thus collides with at least one of x1,...,xT with probability about T/2n-r. Grover's method thus involves 2(n-r)/2/T1/2 iterations (or "2(n-t)/2" in the paper's notation).

Each iteration involves, among other things, an evaluation of Hr, which is 2r/2 iterations of H. Grover's method thus takes time at least 2n/2/T1/2. Meanwhile the hardware cost is on the scale of T. At this point r has disappeared: for example, the case T = 2n/3 takes time at least 2n/3 with hardware cost 2n/3, just like the original BHT case that I described above. This is again beaten by parallel rho.

### CNS false advertising, part 1

The CNS paper claims that if "time-space product (including classical space) ... is the quantity of interest then we can take s = n/5 in our quantum parallel algorithm and we will obtain a time-space product of Õ(212n/25) ... which again beats the best classical algorithms with this benchmarking." Jean-Philippe Aumasson tweeted that this 12n/25 was interesting.

However, the 12n/25 claim is simply wrong. The time-space product actually has exponent 13n/25, which is worse than the n/2 from parallel rho. Let's look at the details.

As background, the best BHT variant considered in my 2009 paper (page 8: "quantum analogue of the more sophisticated size-M machine discussed above") was to compute H(y1),...,H(yT) in parallel and check for any collisions with H(x1),...,H(xT). This variant also doesn't beat parallel rho, although it does match parallel rho if communication costs are ignored.

The CNS paper, not crediting my paper, similarly considers "2s" cores computing Hr(y1),Hr(y2),... in parallel. Instead of one core using 2(n-r)/2/T1/2 iterations, there are 2s cores each using 2(n-r-s)/2/T1/2 iterations. Each iteration takes time 2r/2 as before, so each core uses time 2(n-s)/2/T1/2.

Rather than sorting, the paper applies each of Hr(x1),...,Hr(xT) successively to each of the quantum cores. This takes time T, asymptotically slower than the T1/2 for sorting, although it has the virtue of not requiring quantum communication between the quantum cores. The paper takes T specifically as 2r/2 to balance these T steps against the cost of an Hr computation. Each core thus uses time 2(n-s)/2-r/4.

A reasonable way to handle the communication here would be to put Hr(x1),...,Hr(xT) into a giant "shift register", a specialized T-core device that gradually rotates its input. If the first 2s of these cores are attached to the 2s quantum cores, then each of the quantum cores sees each of Hr(x1),...,Hr(xT) within exactly T steps. I'm assuming here that 2s is at most T.

The initial computation of Hr(x1),...,Hr(xT) takes 2r/2 T = 2r steps, split as 2r-s steps on each of the 2s quantum cores. To balance this against the subsequent time 2(n-s)/2-r/4, the paper takes r = 2n/5 + 2s/5. The time is now 22n/5-3s/5. The assumption that 2s is at most T translates into s being at most n/4.

The CNS paper takes s = n/5. The time is then 27n/25. There are 2n/5 quantum cores, plus T = 26n/25 cores in the shift register. The "time-space product (including classical space)" is the product of 27n/25 and 26n/25, namely 213n/25.

How does the CNS paper claim 212n/25? Details of the calculation aren't provided. Presumably the authors counted only the 2n/5 quantum cores, and forgot to count the 26n/25 non-quantum cores storing 26n/25 hash outputs, even though they said they were (properly) "including classical space". Oops.

### CNS false advertising, part 2

It's common for a paper saying "Algorithm A is beaten by Algorithm B" to be followed by a paper saying "Algorithm B is beaten by Algorithm C". The paper introducing Algorithm C typically tries to argue that Algorithm C is surprising. Occasionally there's an easy argument for this: namely, someone wrote "I conjecture that B is the best algorithm". Typically this conjecture happens because the author writing "Algorithm A is beaten by Algorithm B" thought through avenues of further improvements, and decided that Algorithm B was one of the rare algorithms that couldn't be improved.

Part of the CNS advertising is that its collision algorithm has lower cost than parallel rho. This is false advertising, as I explained above. Another part of the CNS advertising is that parallel rho was conjectured—even more extreme, claimed—to be optimal. This is also false advertising, as I'll explain now.

The first public version of the CNS paper (3 Sep 2017 on eprint) included the following fabrications: first, that in my 2009 paper I "claimed that quantum computers would not make any improvement over on-purpose hardware for collision search", second, that I "remarked that all post-quantum algorithms for collision search had an effective cost of at least 2n/2".

In fact, my paper clearly and repeatedly states that it is analyzing the costs of known algorithms. Here's the quote again:

The point of this paper is that all known quantum algorithms to find collisions in hash functions are less cost-effective than traditional cryptanalytic hardware, even under optimistic assumptions regarding the speed of quantum computers. ... Within the space of known quantum collision algorithms, the most cost-effective algorithms are tantamount to non-quantum algorithms, and it is clear that non-quantum algorithms should be implemented with standard bits rather than with qubits.

I didn't make any claims regarding "all post-quantum algorithms for collision search"; I explicitly referred to all known quantum algorithms for collision search. I made one narrow conjecture regarding a specific class of collision algorithms that I had thought about ("There are several obvious ways to combine quantum search with the rho method, but I have not found any such combinations that improve performance, and I conjecture that—in a suitable generic model—no such improvements are possible"); I avoided making broader conjectures.

After seeing the CNS paper, I privately asked CNS for an erratum. CNS removed the fabrications from a subsequent update of their paper. However, removing a false statement isn't the same as admitting that it's false. There's still no erratum.

This might not seem important as long as nobody has managed to beat 2n/2. But that's not the point. I devote a tremendous amount of effort to carefully assessing cryptographic risks and other security risks. I have very high standards for security claims, security conjectures, and security recommendations. It was wrong for the CNS paper to misrepresent my paper as making a security claim that my paper does not, in fact, make.

### Interlude: multi-target preimages

The collision problem can be viewed as a simplification of another problem that shows up very frequently, namely the multi-target preimage problem.

In the multi-target preimage problem, one is given H(x1),...,H(xT), and the goal is to find some y such that H(y) is in the set {H(x1),...,H(xT)}. A multi-target preimage attack immediately implies a collision attack (assuming H is reasonably far from being injective), so the multi-target preimage problem is at least as difficult as the collision problem, perhaps more difficult. At the end of this blog post I'll say more about the gap between these problems.

Part of the security analysis in the SPHINCS paper considers the post-quantum security of multi-target preimages. "Bernstein  concludes that all known post-quantum collision-finding algorithms cost at least 2n/2, implying that the post-quantum cost of multi-target preimage attacks is also 2n/2", the paper states. "For example, for n = 256 and T = 256 the best post-quantum attacks use only 2100 queries but still cost 2128."

The cost metric used here is again price-performance ratio, i.e., the product of time and hardware cost. Obviously time and hardware cost can each be reduced below 2n/2: the parallel rho method does this for collision search, and parallel Grover does this for preimage search. But nobody has published an algorithm where the product is below 2n/2.

### CNS false advertising, part 3

The CNS paper advertises multi-target preimage search in very much the same way that it advertises collision search: it falsely attributes conjectures to previous papers that did not in fact state those conjectures; it then falsely claims to disprove those conjectures.

Specifically, the CNS paper considers the same 256-target 256-bit preimage-search example used in the SPHINCS paper. The CNS paper states "they claim that the best quantum algorithm has a cost of 2128". This omits a critical piece of context, the word "known" from the SPHINCS paper. Someone reading the SPHINCS paper finds an undisputed analysis of known algorithms; someone reading the summary in the CNS paper incorrectly thinks that the SPHINCS paper was making a claim regarding all algorithms.

The CNS paper then says that "it is possible to attack their example with a time complexity of ... 2119.6". What exactly is this supposed to be contradicting? Cost in the SPHINCS paper is not time: it is the product of time and hardware cost. The SPHINCS paper does not make the silly claim that time alone is at least 2n/2: as noted above, parallel rho beats time 2n/2 for collision search, and parallel Grover beats time 2n/2 for preimage search.

The CNS attack does not beat cost 2128 for n=256: it uses 2119.6 quantum operations times megabytes of memory. The CNS attack beats time 2128 (if the quantum overhead is low enough), but this is not new.

### CNS vagueness

Beyond the clear errors described above, the CNS paper has various claims of novelty whose meaning is unclear, making the claims difficult to evaluate.

The abstract claims that the paper has "the first proof of an actual quantum time speedup for collision search". What do CNS think they mean by "actual time"?

The CNS paper claims that its parallel algorithms "are the first ones to significantly improve on the best classical algorithms" for collision search, and provide "better performance than classical algorithms for a wide range [of] parameters." In precisely what metric do they think they have an improvement compared to non-quantum collision-finding algorithms?

Of course one can point to, e.g., the number of H queries as a metric where the CNS algorithm is faster than non-quantum collision-finding algorithms. But then the claim to be "first" is wrong: the BHT algorithm already does better in this metric. The CNS paper also makes this claim for multi-target preimage search, but how can they say this when non-quantum preimage algorithms are already beaten by Grover's method?

The CNS paper also claims to "solve an open problem and contradict a conjecture on the complexity of quantum collision and multi-target preimage search ... We ask the reader to keep in mind that these results seemed particularly surprising as it was conjectured that quantum algorithms for solving these problems wouldn't be more efficient than classical ones." What exactly is the "open problem" that the authors claim to be solving? Where is the "conjecture" that they claim to be contradicting? And, again, why are they ignoring Grover's method in the case of preimage search?

### Multi-target preimage search, revisited

Let me close by returning to the gap between the complexities of collision search and multi-target preimage search.

In pre-quantum cryptography, multi-target preimage search is quite worrisome. It's the main reason that I tell people to upgrade from 128-bit ciphers to 256-bit ciphers. But multi-target preimage search is certainly not as fast as collision search.

In post-quantum cryptography, single-target preimage search suddenly drops down to 2n/2 operations. Multi-target preimage search is therefore also 2n/2 operations: simply focus on the first target. This is just as few operations as collision search, if we ignore quantum overhead.

The post-quantum situation becomes more subtle if we look at the whole tradeoff curve between hardware cost and time. For example, if the time limit is 2n/3:

• The best collision-search algorithm known uses 2n/6 hardware. (This is what my 2009 paper says, and it's still true today.)
• Multi-target preimage search is at least as hard as collision search; i.e., at least 2n/6 hardware by all known algorithms.
• Multi-target preimage search is no harder than single-target preimage search, which uses 2n/3 hardware by Grover's method.

This leaves a range of possibilities between 2n/6 and 2n/3 for the hardware cost of multi-target preimage search.

Banegas and I published a paper on this topic at SAC 2017. There are various epsilon-power overheads in the paper that need to be analyzed in more detail, but the asymptotic exponents are clear: compared to single-target preimage search, we reduce the hardware cost by a factor T1/2-ε for T targets. (This breaks down when T is beyond the hardware size, but let's focus on the important case of a well-equipped attacker who has more hardware than targets.) Equivalently, we save a factor T1/4-ε in time (with a different ε) for the same hardware cost.

If long-distance communication were free then we would do even better, reducing the hardware cost by a factor close to T. The idea here, as in my 2009 paper, is to make T guesses y1,...,yT in parallel, and then check whether any of H(y1),...,H(yT) match any of H(x1),...,H(xT), by sorting the whole list. This has chance T2/2n of success, and has hardware cost on the scale of T. Free communication means that parallel sorting takes very little time, so with time limit 2n/3 we can afford 2n/3 Grover iterations, checking 22n/3 possibilities for the vector (y1,...,yT) and thus increasing the success chance from T2/2n to T2/2n/3. Running the same thing 2n/3/T2 times in parallel has a good chance of success, and hardware cost on the scale of 2n/3/T.

With realistic communication costs, sorting slows down to T1/2. Losing a factor T1/2 in the number of Grover iterations would reduce the success chance to T/2n/3, not beating single-target preimage search for the same amount of hardware. To do better, we combine Grover's method with the rho method: we compute roughly T1/2 iterations of H on each of y1,...,yT,x1,...,xT, stopping at distinguished points, and sort the results. The paper explains how we do all of this reversibly, as required for Grover's method. In the end we save a factor T1/2-ε in hardware cost for the same time limit.

The CNS paper instead compares Hr(y) to H(x1),...,H(xT). Since Hr(y) starts with r zero bits, there's no point in checking H(xi) if H(xi) doesn't start with r zero bits. CNS thus begin by paring the list H(x1),...,H(xT) down to a reduced list of approximately T/2r targets that do start with r zero bits. Each guess for y thus has chance about (T/2r)/2n-r = T/2n of success. Grover's method uses 2(n-s)/2/T1/2 iterations on each of 2s parallel cores.

The CNS paper takes r as (2/3) log2 T so that serially checking T/2r = T1/3 targets is balanced against the cost of an Hr computation. Each core thus takes time T1/3 2(n-s)/2/T1/2 = 2(n-s)/2/T1/6. To bring the time down to, say, 2n/3, one must take 2s as 2n/3/T1/3, and then the hardware cost is 2n/3/T1/3. This saves a factor T1/3 in hardware cost for the same time limit. (As before, this breaks down when T becomes very large.) My result with Banegas is better, at least asymptotically: T1/2-ε is more of a savings than T1/3.

Banegas and I put our paper online the week of SAC 2017. The CNS paper was posted a few weeks later, but the authors say that it was written independently, and there's no reason to question this. (The SAC and Asiacrypt submission deadlines were months earlier.) The CNS paper could and should have simply reported its straightforward T1/3 improvement for multi-target preimage search, without the false advertising discussed above. But then the paper wouldn't have been accepted to Asiacrypt.

Version: This is version 2017.10.17 of the 20171017-collisions.html web page.