"An efficient quantum collision search algorithm and implications on symmetric cryptography" is a new paper from André Chailloux, Marı́a Naya-Plasencia, and André Schrottenloher. I'll write "CNS" for the list of authors. The CNS paper was accepted to Asiacrypt 2017; Asiacrypt is one of the three yearly flagship conferences of the International Association for Cryptologic Research. The paper appeared online last month.
The title sounds scary, doesn't it? Collision-resistance is one of the primary features promised by cryptographic hash functions such as SHA-256 and SHA3-256. The CNS algorithm is generic: it isn't specific to any particular hash function, so it applies to SHA-256 and SHA3-256 and everything else that comes to mind. Some signature systems are designed to be "collision-resilient", meaning that they aren't broken by collisions in the underlying hash function; but collision-finding is a basic subroutine inside many more attack algorithms. An efficient generic algorithm for collision search would have huge ramifications.
The reality, however, is that the CNS algorithm is not an efficient way to find collisions. In fact, it's so inefficient that it's beaten by the standard non-quantum method of finding collisions, namely "parallel rho" collision search from 1997 van Oorschot–Wiener. Let's take an in-depth look at what's going on here.
The goal is to find collisions in an n-bit hash function H, i.e., a hash function producing n-bit output. This means finding two different inputs, say x and y, such that H(x) = H(y).
The basic strategy for finding collisions, if you don't know anything special about H, is to write down many hash outputs for different inputs and hope that two of the outputs are the same. The famous "birthday paradox" says that this becomes likely to occur when the number of hash outputs grows to the scale of 2^{n/2}. (You can also try a smaller computation, with far fewer than 2^{n/2} hash outputs, but then the success chance drops quadratically. In this blog post I'll focus on computations with a large success chance.)
You can find this collision by storing the output-input pairs in an associative array. Here's a Python script that tries 2^{21} outputs for a 40-bit hash function and has a good chance of finding a collision:
import hashlib def minihash(seed): # 40-bit hash h = hashlib.sha256() h.update(seed) return h.digest()[:5] import random def randomhash(): r = hex(random.randrange(2**100)) return (minihash(r),r) d = dict() for i in range(2**21): h,r = randomhash() if h in d: print r,d[h] break d[h] = r
Alternatively, you can sort the output-input pairs:
h = [randomhash() for i in range(2**21)] h.sort() for i in range(1,len(h)): if h[i][0] == h[i-1][0]: print h[i][1],h[i-1][1]
Either approach uses many megabytes of storage. Obviously storing 2^{n/2} hash outputs involves massive hardware costs when n is reasonably large. But let's imagine that this is feasible. What's less obvious is how much time the computation takes.
A single access to a huge associative array takes a long time. Think about a square chip large enough to hold 2^{n/2} hash outputs: simply communicating from one side of the chip to the other takes time proportional to 2^{n/4}. Packing data more densely into a cube reduces the distance from 2^{n/4} to 2^{n/6}, but raises unsolved problems of getting enough energy into a cubical computing device, and getting the resulting heat out. I'll focus on conventional two-dimensional architectures in this blog post.
If one access takes time 2^{n/4}, do 2^{n/2} accesses take time 2^{3n/4}? Not necessarily: the accesses can be parallelized. There are algorithms that sort 2^{n/2} items on a square chip in total time proportional to 2^{n/4}: there are 2^{n/2} miniature cores, each carrying out (say) 8⋅2^{n/4} compare-exchange steps with adjacent cores. Schimmler's 4-way merge sort is an easy-to-understand algorithm of this type. More complicated algorithms reduce the constant 8.
Of course the initial 2^{n/2} hash computations can also be parallelized: simply run 2^{n/2} separate hash-computation cores. This won't be a bottleneck as long as the time for one hash computation is below the 2^{n/4} scale for memory access.
Or run, say, 2^{n/2}/64 hash-computation cores, each 64 times, distributing the 64 hash outputs to 64 nearby memory units. This still won't be a bottleneck as long as the time for 64 hash computations is below the 2^{n/4} scale for memory access.
To summarize, the hardware costs are on the scale of 2^{n/2}, or more precisely 2^{n/2}/64 hash-computation cores plus 2^{n/2} compare-exchange cores. The time is on the scale of 2^{n/4}, or more precisely 64 hash computations plus 8⋅2^{n/4} compare-exchange steps. The attacker is likely to save a constant factor by looking more closely at the details, including the hash-function details, but in this blog post I'll focus on the exponents: exponent n/2 for hardware cost and exponent n/4 for time.
The parallel rho method takes much less hardware and less time. Specifically, it reduces the time to 2^{n/6}, with hardware costs on the scale of 2^{n/3}. More generally, the parallel rho method offers a huge range of tradeoffs, such as time 2^{n/4} using 2^{n/4} hardware, or time 2^{n/3} using 2^{n/6} hardware.
The rho method uses a small hashing core that starts from some random input x and computes a series of hashes as quickly as possible: H(x), H^{2}(x), H^{3}(x), etc. Here H^{2}(x) means H(H(x)); H^{3}(x) means H(H(H(x))); etc. An important feature of the rho method is that most hash outputs aren't saved; we'll see later how to retroactively recognize collisions.
The core stops when it finds a distinguished point. A distinguished point, by definition, is a string that starts with d zero bits. A random string has probability 1/2^{d} of being distinguished, so presumably the core will find a distinguished point within about 2^{d} hash calls on average.
What if the function H avoids, or tends to avoid, distinguished points? To avoid bad cases, the attacker can replace H(x) with E_{k}(H(x)), where E is the attacker's favorite block cipher, and k is a secret key chosen at the beginning of the attack. A collision E_{k}(H(x)) = E_{k}(H(y)) is equivalent to a collision H(x) = H(y). I'll skip E_{k} for the rest of my description: real hash functions such as SHA-256 are experimentally observed to hit distinguished points at the expected rate.
What if the core enters a cycle before it reaches a distinguished point? For example, what happens if the core sees five different non-distinguished points x,H(x),H^{2}(x),H^{3}(x),H^{4}(x) and then H^{5}(x) turns out to be the same as H^{2}(x)? Let's put a time limit on the computation, say 2^{d+1} hash calls, so the core won't cycle forever. (It's possible, but not worthwhile, to go to more effort to find the collision between inputs H^{4}(x) and H(x) in this example.)
Now build p of these small cores running in parallel, starting with random inputs x_{1},x_{2},...,x_{p}. Each core continues for at most 2^{d+1} hash calls. Most of the cores will find distinguished points. The total number of hash outputs encountered by all of the cores together is proportional to 2^{d}p.
As before, I'll assume that the total number of hash outputs is on the scale of 2^{n/2}. This means that 2^{d}p is on the scale of 2^{n/2}. For example, one can take d as n/6 and p around 2^{n/3}; or d as n/4 and p around 2^{n/4}; or d as n/3 and p around 2^{n/6}. Most cores won't enter cycles as long as d is noticeably smaller than n/2.
Here's a Python script that simulates this process for p = 2^{5} cores, each running for at most 2^{17} steps, using the same 40-bit minihash shown above:
start = random.randrange(2**100) h = [] for core in range(2**5): # all cores can run in parallel seed = hex(start + core) r,j = seed,0 while j < 2**17: r,j = minihash(r),j+1 if r[:2] == 'XX': h += [(r,seed,j)] break
This script defines distinguished points differently: namely, strings where the first two bytes are XX. What matters is that a random string has a specified probability, in this case 1/2^{16}, of being distinguished. The script also takes sequential inputs x_{1},x_{2},...,x_{p} rather than independent random inputs.
What happens if one non-distinguished point, say H^{77584}(x_{7}), is different from another, say H^{92797}(x_{20}), but the next hash value H^{77585}(x_{7}) collides with H^{92798}(x_{20})? The hash value after that, H^{77586}(x_{7}), will equal H^{92799}(x_{20}), and so on. Cores 7 and 20 will end up at the same distinguished point H^{77584+s}(x_{7}) = H^{92797+s}(x_{20}) for some positive integer s, assuming that both cores run long enough to reach this distinguished point. If you see that this happened, and if you're also given records of x_{7} and x_{20} and the counters 77584+s,92797+s, then you can quickly find the collision by computing the difference 15213 of the counters and then checking H^{0}(x_{7}) against H^{15213}(x_{20}), checking H^{1}(x_{7}) against H^{15214}(x_{20}), etc. Here's a Python script that does this, starting from the array h of distinguished points produced earlier:
import binascii def hexstr(s): return binascii.hexlify(s) h.sort() for i in range(1,len(h)): if h[i][0] == h[i-1][0]: r0,j0 = h[i][1],-h[i][2] r1,j1 = h[i-1][1],-h[i-1][2] for i in range(max(j0,j1),0): while j0 < i: r0,j0 = minihash(r0),j0+1 while j1 < i: r1,j1 = minihash(r1),j1+1 if minihash(r0) == minihash(r1): print hexstr(r0),hexstr(r1) break
In other words, to find collisions of hash values, we simply need to find collisions of distinguished points. This is useful because there are far fewer distinguished points than hash values.
The parallel rho method stores the distinguished points in an associative array, or simply sorts them, as in this last Python script. Parallel sorting takes time proportional to p^{1/2} on a square chip. This becomes a bottleneck compared to the 2^{d+1} hash computations as p grows past 2^{n/3}, but it isn't an issue for smaller values of p.
Textbooks on algorithm design generally don't teach students to measure hardware cost and time. Instead they teach students to count the number of operations. (Even worse, the number of operations is typically mislabeled "time".)
These metrics aren't completely unrelated: a chip of area A running for time T can't carry out more than AT operations. But a single operation can involve a huge chip for a significant amount of time. Students are taught, for example, that randomly accessing an array of size A is a single operation, but in fact this operation involves a chip of area A for time A^{1/2}.
The parallel rho method is vastly more efficient than simply storing 2^{n/2} hash outputs. It's hard to see this advantage if one merely counts operations.
Students often learn to count memory usage as a secondary metric, and then they can see that the rho method reduces memory. But they still don't learn that counting operations is a poor predictor of time. They think, for example, that sorting an array is extremely efficient; they don't learn that sorting many hash outputs is much more expensive than computing those hash outputs, with the cost ratio growing as a positive power of the array size.
People who focus on minimizing the number of operations often end up with computations that are unnecessarily large and unnecessarily slow. Consider, for example, the problem of enumerating all primes up to 2^{100}. Textbook sieving makes an array of size 2^{100}, crosses off every second number, crosses off every third number, etc. This uses very few operations, but better-optimized algorithms run faster using much less hardware.
Counting operations is simple, and there is value in simplicity. When I'm analyzing and optimizing algorithms, I usually start by counting operations and making an effort to understand the minimum possible number of operations. But I know that some operations are actually much more expensive than others.
The remaining algorithms that I'll discuss here are quantum algorithms. These algorithms are built from the basic operations that serious quantum-computer engineers are trying to build, specifically "Hadamard gates" and "Toffoli gates" applied to large numbers of "qubits".
The quantum-computer engineers haven't given us these operations yet. There's a company named D-Wave selling "quantum computers"; but my understanding of the scientific consensus is that the current D-Wave computers can be much more cost-effectively simulated by traditional computers and are therefore useless. On the other hand, D-Wave is
Doesn't look like the worst investment!
Google says that it is building a 49-qubit "quantum computer" that, if it works, will provide "quantum supremacy". What this means is that we don't know any way for a traditional computer to accurately simulate the output of the device as cost-effectively as running the device. But, except for the marketing buzzwords, there's nothing new about devices meeting this notion of "quantum supremacy". Anyone can
More to the point, Google's chip still won't be able to run any of the quantum algorithms discussed below.
I don't mean to dismiss the risk of a coming quantum apocalypse. I see many signs of progress in quantum computing. I've recently bet $2048 that the RSA-2048 challenge will be publicly factored by a quantum computer by 2033.
This doesn't mean, however, that quantum computers are going to magically speed up all computations. Big quantum speedups, such as Shor's speedup for integer factorization, have been found for only a few special types of computations. Most quantum speedups are only moderate, such as Grover's method to speed up brute-force searches, which I'll discuss in a moment. Many more computations don't seem to be sped up at all; this is particularly common for computations involving large amounts of data.
People frequently say that quantum computers will force us to move from AES-128 to AES-256. The reason is Grover's method.
Suppose a protocol reveals the AES-128 encryption of a known plaintext block under a secret key. Many protocols do this, and of course AES-128 should remain secure in this known-plaintext scenario. How quickly can an attacker find the key? The obvious attack is to search through all 2^{128} keys; on average this will succeed after about 2^{127} keys. This is too expensive to be a concern today.
Grover's method sounds much more worrisome: it searches through all 2^{128} keys using just 2^{64} quantum evaluations of AES. But this operation count is only the beginning of the analysis. Given current quantum-computer designs, it is reasonable to predict that a quantum evaluation of AES is going to cost considerably more hardware and time than a non-quantum evaluation of AES. Furthermore, Grover's method performs its 2^{64} iterations serially. If one quantum evaluation of AES takes a microsecond, and the attacker wants results within a decade, then Grover's method is limited to 2^{48} iterations, so it searches only 2^{96} keys. Searching all 2^{128} keys requires 2^{32} parallel copies of this computation.
Breaking AES-128 with Grover's method is much less efficient than breaking RSA-2048 with Shor's algorithm. It could be so inefficient as to be infeasible. Even worse, for reasonable time limits, the speedup from Grover's method could be wiped out by the overhead of quantum operations compared to non-quantum operations. This means that Grover's method could be even more expensive than non-quantum attacks against AES-128. In other words, even if Shor's algorithm is successful, Grover's method could turn out to be useless. At the moment NIST sounds quite skeptical about Grover's method:
When all of these considerations are taken into account, it becomes quite likely that variants of Grover's algorithm will provide no advantage to an adversary wishing to perform a cryptanalytic attack that can be completed in a matter of years, or even decades.
All of the quantum algorithms that I'm going to discuss are based on generalizations and applications of Grover's method. If Grover's method turns out to be useless then these algorithms will be useless too. But I'm going to be more optimistic for the attacker: I'm going to assume that quantum overheads can be reduced far enough to make Grover's method useful.
Grover's paper was posted in 1996. The next year, Brassard, Høyer, and Tapp announced that they could find collisions in an n-bit hash function using only about 2^{n/3} quantum evaluations of the function. I'll write "BHT" for Brassard, Høyer, and Tapp.
This 2^{n/3} sounds like a solid improvement compared to the non-quantum 2^{n/2}. The exponent is dropping by a factor 1.5: not as large as Grover's factor 2, but still important for deciding how big n needs to be for security. This is why SIDH uses 768-bit fields instead of 512-bit fields, for example. But looking at hardware cost and time tells a completely different story.
The BHT algorithm works as follows:
Imagine that there is some magical way to quickly carry out comparisons to a giant precomputed list of targets H(x_{1}),H(x_{2}),...,H(x_{T}). The series of 2^{n/3} quantum evaluations of f still includes a series of 2^{n/3} quantum evaluations of H, taking time on the scale of 2^{n/3}. The hardware cost of this algorithm is also on the scale of 2^{n/3}.
For comparison, parallel rho finds a collision in the time for 2^{n/4} non-quantum evaluations of H, using just 2^{n/4} cores. For reasonable functions H, this is much less time and much less hardware than the BHT algorithm. (The picture could change if H cores are gigantic, for example if H is a "password hashing" function, but then the overheads in Grover's method will also be gigantic.)
In short, the BHT algorithm is a step backwards from the non-quantum parallel rho algorithm. I pointed out this problem in a 2009 paper. I also analyzed some variants of the BHT algorithm but couldn't find any variants better than parallel rho:
The point of this paper is that all known quantum algorithms to find collisions in hash functions are less cost-effective than traditional cryptanalytic hardware, even under optimistic assumptions regarding the speed of quantum computers. ... Within the space of known quantum collision algorithms, the most cost-effective algorithms are tantamount to non-quantum algorithms, and it is clear that non-quantum algorithms should be implemented with standard bits rather than with qubits.
I'm not aware of any dispute about any part of my algorithm analysis. NIST, in its call for post-quantum submissions, commented that the best algorithm known to find an AES-128 key takes "2^{170}/MAXDEPTH quantum gates or 2^{143} classical gates", while the best algorithm known to find a SHA3-256 collision takes "2^{146} classical gates" with no quantum speedup.
This brings me to the new CNS paper, "An efficient quantum collision search algorithm and implications on symmetric cryptography". Let's look at what this paper does.
The CNS paper has a parameter r. An example highlighted in the paper is r = n/5. Define H_{r}(x) as follows: search through r-bit strings s to find strings H(x,s) ending with r zero bits; output a random such string, assuming at least one exists. Using Grover's method one can compute H_{r}(x) using approximately 2^{r/2} operations. The paper's strategy to find a collision in H is to find a collision in H_{r}.
Let's try a sanity check at this point. Instead of applying parallel rho to an n-bit hash function, do we gain anything by applying parallel rho to an (n-r)-bit hash function that takes 2^{r/2} times as long to compute? As before, let's ignore quantum overhead; and let's assume that we can work around the complications of H_{r} being non-deterministic and often undefined.
Say we can afford 2^{n/4} hardware. Applying rho to the n-bit hash function takes time 2^{n/4}. Applying rho to an equally fast (n-r)-bit hash function would take time only 2^{n/4-r/2}, saving a factor 2^{r/2}. But we actually have a function that's 2^{r/2} times slower, so the savings disappears. There's no advantage here.
Back to the paper. The paper applies the BHT algorithm to H_{r}: i.e., it precomputes H_{r}(x_{1}),...,H_{r}(x_{T}) and then uses Grover's method to search for y colliding with one of x_{1},...,x_{T} under H_{r}. (T is "2^{t-r}" in the paper's notation.)
If BHT can't beat parallel rho, and parallel rho applied to H_{r} can't beat parallel rho applied to H, then how can BHT applied to H_{r} beat parallel rho applied to H? (As mentioned above, BHT doesn't look so bad if H cores are gigantic, but replacing H with H_{r} doesn't move in this direction.)
Let's go through the analysis of time and hardware cost. Presumably y collides with x_{i} under H_{r} with probability about 1/2^{n-r}, since r bits of H_{r} are already forced to be 0; and thus collides with at least one of x_{1},...,x_{T} with probability about T/2^{n-r}. Grover's method thus involves 2^{(n-r)/2}/T^{1/2} iterations (or "2^{(n-t)/2}" in the paper's notation).
Each iteration involves, among other things, an evaluation of H_{r}, which is 2^{r/2} iterations of H. Grover's method thus takes time at least 2^{n/2}/T^{1/2}. Meanwhile the hardware cost is on the scale of T. At this point r has disappeared: for example, the case T = 2^{n/3} takes time at least 2^{n/3} with hardware cost 2^{n/3}, just like the original BHT case that I described above. This is again beaten by parallel rho.
The CNS paper claims that if "time-space product (including classical space) ... is the quantity of interest then we can take s = n/5 in our quantum parallel algorithm and we will obtain a time-space product of Õ(2^{12n/25}) ... which again beats the best classical algorithms with this benchmarking." Jean-Philippe Aumasson tweeted that this 12n/25 was interesting.
However, the 12n/25 claim is simply wrong. The time-space product actually has exponent 13n/25, which is worse than the n/2 from parallel rho. Let's look at the details.
As background, the best BHT variant considered in my 2009 paper (page 8: "quantum analogue of the more sophisticated size-M machine discussed above") was to compute H(y_{1}),...,H(y_{T}) in parallel and check for any collisions with H(x_{1}),...,H(x_{T}). This variant also doesn't beat parallel rho, although it does match parallel rho if communication costs are ignored.
The CNS paper, not crediting my paper, similarly considers "2^{s}" cores computing H_{r}(y_{1}),H_{r}(y_{2}),... in parallel. Instead of one core using 2^{(n-r)/2}/T^{1/2} iterations, there are 2^{s} cores each using 2^{(n-r-s)/2}/T^{1/2} iterations. Each iteration takes time 2^{r/2} as before, so each core uses time 2^{(n-s)/2}/T^{1/2}.
Rather than sorting, the paper applies each of H_{r}(x_{1}),...,H_{r}(x_{T}) successively to each of the quantum cores. This takes time T, asymptotically slower than the T^{1/2} for sorting, although it has the virtue of not requiring quantum communication between the quantum cores. The paper takes T specifically as 2^{r/2} to balance these T steps against the cost of an H_{r} computation. Each core thus uses time 2^{(n-s)/2-r/4}.
A reasonable way to handle the communication here would be to put H_{r}(x_{1}),...,H_{r}(x_{T}) into a giant "shift register", a specialized T-core device that gradually rotates its input. If the first 2^{s} of these cores are attached to the 2^{s} quantum cores, then each of the quantum cores sees each of H_{r}(x_{1}),...,H_{r}(x_{T}) within exactly T steps. I'm assuming here that 2^{s} is at most T.
The initial computation of H_{r}(x_{1}),...,H_{r}(x_{T}) takes 2^{r/2} T = 2^{r} steps, split as 2^{r-s} steps on each of the 2^{s} quantum cores. To balance this against the subsequent time 2^{(n-s)/2-r/4}, the paper takes r = 2n/5 + 2s/5. The time is now 2^{2n/5-3s/5}. The assumption that 2^{s} is at most T translates into s being at most n/4.
The CNS paper takes s = n/5. The time is then 2^{7n/25}. There are 2^{n/5} quantum cores, plus T = 2^{6n/25} cores in the shift register. The "time-space product (including classical space)" is the product of 2^{7n/25} and 2^{6n/25}, namely 2^{13n/25}.
How does the CNS paper claim 2^{12n/25}? Details of the calculation aren't provided. Presumably the authors counted only the 2^{n/5} quantum cores, and forgot to count the 2^{6n/25} non-quantum cores storing 2^{6n/25} hash outputs, even though they said they were (properly) "including classical space". Oops.
It's common for a paper saying "Algorithm A is beaten by Algorithm B" to be followed by a paper saying "Algorithm B is beaten by Algorithm C". The paper introducing Algorithm C typically tries to argue that Algorithm C is surprising. Occasionally there's an easy argument for this: namely, someone wrote "I conjecture that B is the best algorithm". Typically this conjecture happens because the author writing "Algorithm A is beaten by Algorithm B" thought through avenues of further improvements, and decided that Algorithm B was one of the rare algorithms that couldn't be improved.
Part of the CNS advertising is that its collision algorithm has lower cost than parallel rho. This is false advertising, as I explained above. Another part of the CNS advertising is that parallel rho was conjectured—even more extreme, claimed—to be optimal. This is also false advertising, as I'll explain now.
The first public version of the CNS paper (3 Sep 2017 on eprint) included the following fabrications: first, that in my 2009 paper I "claimed that quantum computers would not make any improvement over on-purpose hardware for collision search", second, that I "remarked that all post-quantum algorithms for collision search had an effective cost of at least 2^{n/2}".
In fact, my paper clearly and repeatedly states that it is analyzing the costs of known algorithms. Here's the quote again:
The point of this paper is that all known quantum algorithms to find collisions in hash functions are less cost-effective than traditional cryptanalytic hardware, even under optimistic assumptions regarding the speed of quantum computers. ... Within the space of known quantum collision algorithms, the most cost-effective algorithms are tantamount to non-quantum algorithms, and it is clear that non-quantum algorithms should be implemented with standard bits rather than with qubits.
I didn't make any claims regarding "all post-quantum algorithms for collision search"; I explicitly referred to all known quantum algorithms for collision search. I made one narrow conjecture regarding a specific class of collision algorithms that I had thought about ("There are several obvious ways to combine quantum search with the rho method, but I have not found any such combinations that improve performance, and I conjecture that—in a suitable generic model—no such improvements are possible"); I avoided making broader conjectures.
After seeing the CNS paper, I privately asked CNS for an erratum. CNS removed the fabrications from a subsequent update of their paper. However, removing a false statement isn't the same as admitting that it's false. There's still no erratum.
This might not seem important as long as nobody has managed to beat 2^{n/2}. But that's not the point. I devote a tremendous amount of effort to carefully assessing cryptographic risks and other security risks. I have very high standards for security claims, security conjectures, and security recommendations. It was wrong for the CNS paper to misrepresent my paper as making a security claim that my paper does not, in fact, make.
The collision problem can be viewed as a simplification of another problem that shows up very frequently, namely the multi-target preimage problem.
In the multi-target preimage problem, one is given H(x_{1}),...,H(x_{T}), and the goal is to find some y such that H(y) is in the set {H(x_{1}),...,H(x_{T})}. A multi-target preimage attack immediately implies a collision attack (assuming H is reasonably far from being injective), so the multi-target preimage problem is at least as difficult as the collision problem, perhaps more difficult. At the end of this blog post I'll say more about the gap between these problems.
Part of the security analysis in the SPHINCS paper considers the post-quantum security of multi-target preimages. "Bernstein [8] concludes that all known post-quantum collision-finding algorithms cost at least 2^{n/2}, implying that the post-quantum cost of multi-target preimage attacks is also 2^{n/2}", the paper states. "For example, for n = 256 and T = 2^{56} the best post-quantum attacks use only 2^{100} queries but still cost 2^{128}."
The cost metric used here is again price-performance ratio, i.e., the product of time and hardware cost. Obviously time and hardware cost can each be reduced below 2^{n/2}: the parallel rho method does this for collision search, and parallel Grover does this for preimage search. But nobody has published an algorithm where the product is below 2^{n/2}.
The CNS paper advertises multi-target preimage search in very much the same way that it advertises collision search: it falsely attributes conjectures to previous papers that did not in fact state those conjectures; it then falsely claims to disprove those conjectures.
Specifically, the CNS paper considers the same 2^{56}-target 256-bit preimage-search example used in the SPHINCS paper. The CNS paper states "they claim that the best quantum algorithm has a cost of 2^{128}". This omits a critical piece of context, the word "known" from the SPHINCS paper. Someone reading the SPHINCS paper finds an undisputed analysis of known algorithms; someone reading the summary in the CNS paper incorrectly thinks that the SPHINCS paper was making a claim regarding all algorithms.
The CNS paper then says that "it is possible to attack their example with a time complexity of ... 2^{119.6}". What exactly is this supposed to be contradicting? Cost in the SPHINCS paper is not time: it is the product of time and hardware cost. The SPHINCS paper does not make the silly claim that time alone is at least 2^{n/2}: as noted above, parallel rho beats time 2^{n/2} for collision search, and parallel Grover beats time 2^{n/2} for preimage search.
The CNS attack does not beat cost 2^{128} for n=256: it uses 2^{119.6} quantum operations times megabytes of memory. The CNS attack beats time 2^{128} (if the quantum overhead is low enough), but this is not new.
Beyond the clear errors described above, the CNS paper has various claims of novelty whose meaning is unclear, making the claims difficult to evaluate.
The abstract claims that the paper has "the first proof of an actual quantum time speedup for collision search". What do CNS think they mean by "actual time"?
The CNS paper claims that its parallel algorithms "are the first ones to significantly improve on the best classical algorithms" for collision search, and provide "better performance than classical algorithms for a wide range [of] parameters." In precisely what metric do they think they have an improvement compared to non-quantum collision-finding algorithms?
Of course one can point to, e.g., the number of H queries as a metric where the CNS algorithm is faster than non-quantum collision-finding algorithms. But then the claim to be "first" is wrong: the BHT algorithm already does better in this metric. The CNS paper also makes this claim for multi-target preimage search, but how can they say this when non-quantum preimage algorithms are already beaten by Grover's method?
The CNS paper also claims to "solve an open problem and contradict a conjecture on the complexity of quantum collision and multi-target preimage search ... We ask the reader to keep in mind that these results seemed particularly surprising as it was conjectured that quantum algorithms for solving these problems wouldn't be more efficient than classical ones." What exactly is the "open problem" that the authors claim to be solving? Where is the "conjecture" that they claim to be contradicting? And, again, why are they ignoring Grover's method in the case of preimage search?
Let me close by returning to the gap between the complexities of collision search and multi-target preimage search.
In pre-quantum cryptography, multi-target preimage search is quite worrisome. It's the main reason that I tell people to upgrade from 128-bit ciphers to 256-bit ciphers. But multi-target preimage search is certainly not as fast as collision search.
In post-quantum cryptography, single-target preimage search suddenly drops down to 2^{n/2} operations. Multi-target preimage search is therefore also 2^{n/2} operations: simply focus on the first target. This is just as few operations as collision search, if we ignore quantum overhead.
The post-quantum situation becomes more subtle if we look at the whole tradeoff curve between hardware cost and time. For example, if the time limit is 2^{n/3}:
This leaves a range of possibilities between 2^{n/6} and 2^{n/3} for the hardware cost of multi-target preimage search.
Banegas and I published a paper on this topic at SAC 2017. There are various epsilon-power overheads in the paper that need to be analyzed in more detail, but the asymptotic exponents are clear: compared to single-target preimage search, we reduce the hardware cost by a factor T^{1/2-ε} for T targets. (This breaks down when T is beyond the hardware size, but let's focus on the important case of a well-equipped attacker who has more hardware than targets.) Equivalently, we save a factor T^{1/4-ε} in time (with a different ε) for the same hardware cost.
If long-distance communication were free then we would do even better, reducing the hardware cost by a factor close to T. The idea here, as in my 2009 paper, is to make T guesses y_{1},...,y_{T} in parallel, and then check whether any of H(y_{1}),...,H(y_{T}) match any of H(x_{1}),...,H(x_{T}), by sorting the whole list. This has chance T^{2}/2^{n} of success, and has hardware cost on the scale of T. Free communication means that parallel sorting takes very little time, so with time limit 2^{n/3} we can afford 2^{n/3} Grover iterations, checking 2^{2n/3} possibilities for the vector (y_{1},...,y_{T}) and thus increasing the success chance from T^{2}/2^{n} to T^{2}/2^{n/3}. Running the same thing 2^{n/3}/T^{2} times in parallel has a good chance of success, and hardware cost on the scale of 2^{n/3}/T.
With realistic communication costs, sorting slows down to T^{1/2}. Losing a factor T^{1/2} in the number of Grover iterations would reduce the success chance to T/2^{n/3}, not beating single-target preimage search for the same amount of hardware. To do better, we combine Grover's method with the rho method: we compute roughly T^{1/2} iterations of H on each of y_{1},...,y_{T},x_{1},...,x_{T}, stopping at distinguished points, and sort the results. The paper explains how we do all of this reversibly, as required for Grover's method. In the end we save a factor T^{1/2-ε} in hardware cost for the same time limit.
The CNS paper instead compares H_{r}(y) to H(x_{1}),...,H(x_{T}). Since H_{r}(y) starts with r zero bits, there's no point in checking H(x_{i}) if H(x_{i}) doesn't start with r zero bits. CNS thus begin by paring the list H(x_{1}),...,H(x_{T}) down to a reduced list of approximately T/2^{r} targets that do start with r zero bits. Each guess for y thus has chance about (T/2^{r})/2^{n-r} = T/2^{n} of success. Grover's method uses 2^{(n-s)/2}/T^{1/2} iterations on each of 2^{s} parallel cores.
The CNS paper takes r as (2/3) log_{2} T so that serially checking T/2^{r} = T^{1/3} targets is balanced against the cost of an H_{r} computation. Each core thus takes time T^{1/3} 2^{(n-s)/2}/T^{1/2} = 2^{(n-s)/2}/T^{1/6}. To bring the time down to, say, 2^{n/3}, one must take 2^{s} as 2^{n/3}/T^{1/3}, and then the hardware cost is 2^{n/3}/T^{1/3}. This saves a factor T^{1/3} in hardware cost for the same time limit. (As before, this breaks down when T becomes very large.) My result with Banegas is better, at least asymptotically: T^{1/2-ε} is more of a savings than T^{1/3}.
Banegas and I put our paper online the week of SAC 2017. The CNS paper was posted a few weeks later, but the authors say that it was written independently, and there's no reason to question this. (The SAC and Asiacrypt submission deadlines were months earlier.) The CNS paper could and should have simply reported its straightforward T^{1/3} improvement for multi-target preimage search, without the false advertising discussed above. But then the paper wouldn't have been accepted to Asiacrypt.
[2022.01.09 update: Updated links above.]