Tung Chou and I have a new software framework called CryptAttackTester for high-assurance quantification of the costs of cryptographic attack algorithms. So far the framework includes two case studies: brute-force AES-128 key search, and, as a deeper case study, "information-set decoding" attacks against the McEliece cryptosystem. The accompanying paper also looks at other attacks, covering various old and new examples of how attack analyses have gone wrong.
One of the appendices in the paper, Appendix D, chops almost 10 bits out of the "gate" count for "primal" attacks against Kyber-512. This reduction uses a technique that in this blog post I'll call "clumping". Clumping should also reduce the "gate" counts for "dual" attacks, but for this blog post it suffices to consider primal attacks.
The reason I'm putting "gate" in quotes here is that this is using the concept of "gates" in the Kyber documentation. As we'll see below, this concept doesn't match what hardware designers call gates, in particular in its handling of memory access.
This blog post has four goals:
Explain how "31-bit clumping" reduces the "gate" count for these attacks by several bits. In particular, I'll explain what the Kyber documentation says is the central bottleneck in the attacks, and what clumping does.
Show that 31-bit clumping marginally increases costs when "gate" counts are replaced with a cost metric that accounts in a realistic way for the costs of memory access. This cost metric comes from the NTRU Prime documentation.
Show that 31-bit clumping reduces costs by several bits according to a calculation that NIST misattributed to the NTRU Prime documentation. Here are NIST's words: "40 bits of security more than would be suggested by the RAM model ... approximately 40 bits of additional security quoted as the 'real cost of memory access' by the NTRUprime submission".
In case you're thinking "Maybe NIST got the calculation right in most cases except for some mistake that somehow underestimates the cost of clumping": Show that the gap between NIST's misattributed calculation and the correct calculation comes from (1) NIST inflating the correct calculation and (2) this inflation factor being reduced by 31-bit clumping.
The formulas in question are easy to summarize:
If you correctly tally not just the cost of bit operations in an attack but also the cost of memory access in the attack, you end up with memcost + bitopscost.
Typically attacks are structured as a series of separate iterations, and the algorithm analysis decomposes into analyses of memory access per iteration, bit operations per iteration, and the number of iterations. The total cost is then (memcost/iter + bitopscost/iter) · iter.
NIST's "40 bits of security more than would be suggested by the RAM model" calculation, where NIST says this 2^{40} factor is the "real cost of memory access", is memcost/iter · bitopscost, which is equivalent to (memcost/iter · bitopscost/iter) · iter.
NIST's calculation is incorrectly replacing addition of memcost/iter and bitopscost/iter with multiplication of memcost/iter and bitopscost/iter. The multiplication is nonsense: it doesn't even pass basic type-checking.
Numerically, if memcost/iter is larger than bitopscost/iter, then NIST's calculation ends up multiplying the correct result by a fake bitopscost/iter factor. Clumping reduces that factor, as we'll see below.
Relationship to surrounding events. Readers who simply want to check what I'm saying above can skip past this section and read about the algorithms. But I think many readers will also be interested in the important procedural question of what previous review took place for NIST's calculation.
What's deeply concerning here is that NIST, despite claiming "We operate transparently. We've shown all our work", has tried very hard to sabotage review of its calculation of the Kyber-512 security level.
The starting point was NIST carrying out its analysis in secret, including consultations with the Kyber team. The records of those consultations still aren't public. NIST illegally stonewalled in response to my subsequent FOIA request. I've filed a FOIA lawsuit, but lawsuits take time to resolve.
In November 2022, NIST announced its conclusion that Kyber-512 was "unlikely" to be less secure than AES-128 when memory-access costs are taken into account. NIST didn't explain how it reached this conclusion. I had to ask repeatedly before NIST posted its calculation.
NIST's posting phrased the calculation in an unnecessarily obfuscated form, avoiding the normal tools that scientists use to help reviewers catch any errors that occur: definitions, formulas, double-checks, etc. NIST then repeatedly dodged my clarification questions about the calculation.
Eventually I decided that NIST's stonewalling couldn't be allowed to halt security review. So I wrote my previous blog post. That post summarizes NIST's calculation error, explains how important the error is in the context of NIST's decisions regarding Kyber, and goes line by line through what NIST had posted.
I also sent NIST's mailing list a much more concise sanity check that's flunked by NIST's calculation: an example of an attack where NIST's "40 bits of security more than would be suggested by the RAM model" is numerically very far above the estimated memcost and bitopscost in NIST's alleged sources. The critical point isn't the magnitude of this particular gap; the critical point is that NIST is inflating its security-level claims by using a cost calculation that's fundamentally wrong.
To highlight what exactly I was challenging, I started my mailing-list message with the critical NIST quote and a summary of my challenge:
'Perlner, Ray A. (Fed)' via pqc-forum writes:
40 bits of security more than would be suggested by the RAM model [ ... ] approximately 40 bits of additional security quoted as the "real cost of memory access" by the NTRUprime submission
This is (1) not plausible and (2) not what the NTRU Prime submission says. It is, in context, NIST exaggerating the Kyber-512 security level.
Some of the followups to my message were obviously off-topic, such as ad-hominem attacks, praise for Kyber-512's efficiency, and praise for NIST manipulating its minimum criteria to allow Kyber-512 (seriously: "adjust the definition of security category 1 if needed, but make sure to standardize ML-KEM-512").
Four replies sounded on-topic but were actually dodging. One structural aspect of the dodging is easy to see: the replies being portrayed as defenses of the challenged text don't quote the challenged text.
Regarding content details, the most important dodge is as follows. The dispute at hand is between the following two calculations:
Correct calculation: memcost + bitopscost = (memcost/iter + bitopscost/iter) · iter.
NIST's incorrect calculation: memcost/iter · bitopscost = (memcost/iter · bitopscost/iter) · iter.
The dodge is to hype the undisputed part—the multiplication by iter or, in more detail, various factors inside iter—while ignoring the disputed part—the multiplication of memcost/iter by bitopscost/iter.
My previous blog post had already emphasized the distinction between these two parts ("The research that would be needed for a correct calculation. To fix NIST's calculation, one needs to carefully distinguish two different effects: ..."). As part of describing this distinction, my previous blog post had emphasized that the memcost/iter · bitopscost/iter multiplication is wrong ("exactly the central mistake highlighted in this blog post") while the other multiplication is fine ("Multiplying the new iteration count by the cost of memory access per iteration makes perfect sense"). But it takes time for readers to read through everything and see that the replies to my message are misrepresenting the topic of dispute.
A further review difficulty is that the numbers showing up in the sanity check, such as 2^{151} from the Kyber documentation, are coming from very complicated algorithm analyses in the literature with very large uncertainties. It's terribly time-consuming for readers to go through those analyses.
I've realized that there's a much easier way to see that NIST's calculation is wrong: evaluate, from first principles, the cost of (1) a simple algorithm for carrying out a simple operation and (2) the same algorithm plus clumping. Again, what's important about clumping here is that it reduces bitopscost/iter without reducing memcost/iter.
The xor-and-popcount operation. The simple operation analyzed in this blog post is the operation stated on page 11 of an Asiacrypt 2020 paper: "load h(v) from memory, compute the Hamming weight of h(u) ⊕ h(v), and check whether the Hamming weight is less than or equal to k".
This blog post focuses on the cost of carrying out many iterations, let's say P iterations, of this xor-and-popcount operation. I'll explain what the 2020 paper says about the cost, explain how clumping does better, look at what the NTRU Prime cost metric says about the costs of the non-clumped and clumped algorithms, and look at what NIST's calculation says about these algorithms.
This xor-and-popcount operation is directly on point: it's the core of the primal attack considered in the latest Kyber documentation. (Readers who simply want to see the algorithm analysis and don't care about this context can skip to the next section.)
I'm not assuming that readers are familiar with this attack, and I'm also not asking readers to trust me that the xor-and-popcount operation analyzed here is the core of that attack. This is easy to check. Here's an excerpt from Appendix D of the CryptAttackTester paper saying how to check, starting from the latest Kyber specification:
That specification ... estimates [123, page 27] that attacking Kyber-512 involves 2^{151.5} "gates": specifically, 2^{14.1} calls to "AllPairSearch", times "a cost of about 2^{137.4} gates for AllPairSearch in dimension 375". The latter cost is based on "explicit gate counts for the innermost loop operations (XOR-popcounts, inner products)" and is attributed to [9]. ... The paper [9] says that a "XOR and Population Count" operation, "popcount", is its "primary optimisation target". This operation "loads u and v from specified memory addresses, computes h(u) and h(v), computes the Hamming weight of h(u) ⊕ h(v), and checks whether it is less than or equal to k".
The underlying paper "[9]" is the 2020 paper. That paper has "quantum" in the title but also considers non-quantum computations; the 2^{151.5} is non-quantum, and I'm similarly focusing on non-quantum computations throughout this blog post.
The "primary optimisation target" quote comes from page 2 of the 2020 paper. The "loads" quote comes from page 11. The 2020 paper goes on to say h(u) is cached, so its xor-and-popcount algorithms simply "load h(v) from memory, compute the Hamming weight of h(u) ⊕ h(v), and check whether the Hamming weight is less than or equal to k". That's the xor-and-popcount description I gave above.
As a side note, Appendix D of the CryptAttackTester paper also explains how clumping reduces "gate" counts for the secondary "inner products" operation (which was briefly mentioned in one quote above). For this blog post, there's no reason to go through this extra work. Simply comparing two algorithms for the primary xor-and-popcount bottleneck will show how NIST is misevaluating algorithm costs.
"Gates" for the 2020 algorithm. Let's start with the "Hamming weight" computation.
The Hamming weight of a bit vector is simply the sum of the bits of the vector, i.e., the number of bits set to 1. For example, the Hamming weight of (1,0,1,1,0) is 3.
How do you build a circuit to add two bits (x,y), representing the output as a 2-bit result? The obvious approach is to separately compute the bottom bit, which is x XOR y, and the top bit, which is x AND y; I'll write this in little-endian order as (x XOR y,x AND y). XOR and AND are among the "gates" allowed in the 2020 paper, so this computation costs 2 "gates".
How about three bits x,y,z? The easy part is that the bottom bit of the result is x XOR y XOR z. The top bit is set if the majority of x,y,z are 1; one way to compute this is as (x AND y) OR ((x XOR y) AND z). This might look like 6 "gates", but the x XOR y is shared, giving just 5 "gates".
Another approach is to split three-bit addition into first adding x and y, giving (x XOR y,x AND y); then adding x XOR y to z, giving (x XOR y XOR z,(x XOR y) AND z); and then adding the two top bits, giving (x AND y) XOR ((x XOR y) AND z), obviously with no carry possible into a further bit since x+y+z always fits into 2 bits. This is essentially the same as the formula from the previous paragraph, except for XOR vs. OR, either of which is allowed as a cost-1 "gate".
Let's jump to adding 7 bits (t,u,v,w,x,y,z):
First use 5 "gates" to add 3 bits (t,u,v), producing a 2-bit intermediate result (a,b) representing a+2b = t+u+v.
Then use another 5 "gates" to add another 3 bits (w,x,y), producing another 2-bit intermediate result (c,d) representing c+2d = w+x+y.
The remaining problem is to add a+2b, c+2d, and the last input bit z. Use 5 "gates" to compute a+c+z, obtaining (e,f), which represents e+2f; and then use 5 "gates" to compute b+d+f, obtaining (g,h), which represents g+2h. The answer is then (e,g,h), representing e+2g+4h.
Overall this takes 20 "gates".
If there are 15 bits to add, then the same standard approach takes 20 "gates" to add 7 bits, 20 "gates" to add another 7 bits, and then 15 "gates" to add the intermediate 3-bit results to the last bit, for a total of 55 "gates".
More generally, if there are n bits when n is 2^{ℓ}−1, then this approach uses 5(n−ℓ) "gates", slightly smaller than 5n. You can also use an unbalanced split to handle intermediate sizes of n.
The 2020 paper concludes on page 13 that the "overall instruction count is 6n−4ℓ−5" where ℓ is the number of bits in n. Multiplying this instruction count by P for P xor-and-popcount iterations gives a total of (6n−4ℓ−5)·P.
The reason that the main term in the 2020 paper is 6n instead of 5n is that there's also an initial computation of "h(u) ⊕ h(v)", which the paper straightforwardly handles with n XOR "gates". There's also a small computation at the end to check whether the Hamming weight "is less than or equal to k". As for retrieving the h(v) input in the first place (remember that h(u) was cached), the 2020 paper says on page 13 that "loading h(v) has cost 1".
Suddenly the hardware designers in my audience are jumping up to object: "Wait, what? They think it costs just 1 gate to do an entire memory lookup for h(v)?"
As I said before, the set of "gates" that the Kyber documentation is talking about isn't what hardware designers expect. In particular, it includes a "gate" that carries out a memory lookup in an arbitrarily large array. The way page 4 of the 2020 paper phrases this is that the paper considers "programs for RAM machines (random access memory machines)" where the instruction set has "NOT, AND, OR, XOR, LOAD, STORE" operations and "the cost of a RAM program is the number of instructions that it performs". So "LOAD" has cost 1 by definition, just like XOR.
In the case of Kyber-512, the 2020 paper ends up selecting n = 511. (Page 23 of the paper says "we only consider values of the popcount parameter n that are one less than a power of two"; the code at the end of the paper keeps doubling until it reaches or exceeds the input dimension.) For this choice of n, the "instruction count" in the 2020 paper—the number of "gates" in the Kyber documentation—is 3025 per xor-and-popcount, or 3025·P for the total algorithm.
"Gates" for a clumped algorithm. Clumping does a better job of exploiting the 2020 paper's declaration that "LOAD" (e.g., "loading h(v)") has cost just 1.
As a small example, say we want to add 7 bits a,b,c,d,e,f,g. Instead of spending 20 "gates" on bit operations as above, we can spend just 1 "gate" on a table lookup indexed by (a,b,c,d,e,f,g). The table has 2^{7} entries, each with a 3-bit output.
Compared to what the 2020 paper achieves in its selected metric for its "primary optimisation target", suddenly there's a factor 20 disappearing!
(Did you think clumping was going to be something difficult? Why? Because the 2020 paper was at a flagship IACR conference? Because NIST claimed on page 18 of its 2022 selection report that the Kyber documentation "included a thorough and detailed security analysis"?)
Back to the algorithm analysis. Can we really get this factor 20 compared to the original 6n gates, if the 6n includes n XORs to compute h(u) ⊕ h(v) in the first place?
Yes, we can, by simply absorbing the XORs into the table lookups. Say we're starting with a,b,c,d,e,f,g and A,B,C,D,E,F,G, and we want to compute (a XOR A) + (b XOR B) + (c XOR C) + (d XOR D) + (e XOR E) + (f XOR F) + (g XOR G). A table lookup indexed by the 14 input bits costs just 1 "gate".
There's no reason to stop with 7+7 bits. Larger and larger tables reduce the "gate" counts more and more, as long as the tables don't become large enough for the initial table-computation time to be a bottleneck. For the cost evaluations in this blog post, I'll assume that P, the number of xor-and-popcount iterations, is large enough to make the table-computation time negligible.
For example, let's say we take a table of size 2^{62}, mapping two 31-bit inputs to the 5-bit sum of bits of the xor of the inputs. This is what I'm calling "31-bit clumping". The Kyber-512 attacks in the Kyber documentation use more memory than this, and the table-computation time isn't a bottleneck.
We can use this table of size 2^{62} to handle n-bit vectors h(u) and h(v). Take 31-bit pieces of h(u) and h(v), and do one table lookup to compress each piece of h(u) and the corresponding piece of h(v) to a 5-bit sum. This uses just ceil(n/31) "gates", and reduces the amount of data to handle by a factor close to 12.4. Then apply another table of size just 2^{50} that takes 10 5-bit sums and produces an 8-bit output; ceil(n/310) "gates" then compress the amount of data to handle by another factor close to 6.25. Et cetera.
Compared to the original number of "gates" (around 6·n), 31-bit clumping is saving well over a factor 100 for, e.g., the n = 511 used in the 2020 paper for Kyber-512.
Appendix D of the CryptAttackTester paper uses larger table sizes and gets the "gate" count for 511-bit xor-and-popcount down to 8, which is 378 times smaller than the 3025 "gates" from the 2020 paper. That's the almost-10-bit improvement mentioned above. (To be more precise, it's an 8.56-bit improvement for the xor-and-popcount operation. For a full analysis of the impact on the primal attack, one also needs to quantify the speedup for "inner products" et al.; this blog post focuses on xor-and-popcount.)
The 2020 algorithm in the NTRU Prime cost metric. Page 57 of the NTRU Prime documentation says "we estimate the cost of each access to a bit within N bits of memory as the cost of N^{0.5}/2^{5} bit operations".
The documentation explains how the N^{0.5} comes from standard two-dimensional models of circuits, and explains how the 2^{5} denominator is derived from energy figures published by Intel. Earlier versions of the NTRU Prime documentation had simply used N^{0.5} without trying to estimate this denominator.
(The documentation also mentions that three-dimensional models reduce exponent 1/2 to 1/3, but that it isn't known how to do better than 1/2 in metrics that account for the cost of energy transmission. One can spend endless time considering the impact of a long list of different metrics. What matters for evaluating NIST's "40 bits of security more than would be suggested by the RAM model ... approximately 40 bits of additional security quoted as the 'real cost of memory access' by the NTRUprime submission" claim is specifically the NTRU Prime cost metric.)
Let's calculate costs in this metric for the 2020 xor-and-popcount algorithm:
There's an initial lookup of h(v) in a large table, let's say a table using T bits of memory overall. That's n bits accessed, each of which has the same cost in this metric as T^{0.5}/2^{5} bit operations, for a total of n·T^{0.5}/2^{5}.
There are then about 6n real bit operations that I was describing earlier, basically 5n for the Hamming-weight computation plus n for the vector xor.
The total cost of P xor-and-popcount iterations is then (n·T^{0.5}/2^{5} + 6n)·P.
If T is 2^{70} then n·T^{0.5}/2^{5} is 2^{30}·n, which is on a much larger scale than 6n. I'll assume that T is at least this large. The total cost is then basically (n·T^{0.5}/2^{5})·P.
Algorithm designers looking at these numbers will be thinking something like this: "Hmmm, given that the memory access is so expensive, what else can I do with low-cost bit operations to extract more value out of the data retrieved from memory, and ultimately to get the same job done with fewer iterations?" As my previous blog post says, "a change of cost metric also forces reoptimization of the entire stack of attack subroutines, along with all applicable parameters".
If the job is defined as P iterations of xor-and-popcount, as in this blog post, then, well, that's a job inherently dominated by memory access in any realistic cost metric once the table sizes are large. NIST's miscalculation (see below) is inflating the cost by a factor of "only" 3000 for the sizes of interest here.
But if the job is actually to break a lattice system, then xor-and-popcount is just one of many options in the literature, an option that has attracted attention because of its low "gate" count. Someone trying to optimize large-scale attacks in a realistic cost metric will consider more sophisticated inner loops with smaller iteration counts even when each iteration has "gate" counts in the millions or billions, also meaning that NIST's inflation factor is millions or billions.
Anyway, this blog post focuses on the 2020 algorithm for xor-and-popcount and a clumped version of that. This is enough to see that NIST used the wrong calculation.
The clumping algorithm in the NTRU Prime cost metric. Status so far: we have the "gate" counts for xor-and-popcount with and without clumping, and the NTRU Prime metric for xor-and-popcount without clumping. Next step: NTRU Prime metric for xor-and-popcount with clumping.
Consider, as one of the concrete examples from above, 31-bit clumping: clumping 31-bit xor-and-popcount operations into a 2^{62}-entry table. The table has 5-bit outputs, so 5·2^{62} bits overall. The cost of each access to a bit within 5·2^{62} bits is, in this metric, the cost of (5·2^{62})^{0.5}/2^{5} bit operations, so retrieving 5 bits costs 5·(5·2^{62})^{0.5}/2^{5} bit operations. If that table is reused for n-bit xor-and-popcount operations, then there are about n/31 of these table lookups, together costing 5·n·(5·2^{62})^{0.5}/(2^{5}·31) bit operations, i.e., about 2^{24.53}n bit operations.
There are then further table lookups for adding the 5-bit results, but those are retrieving a smaller number of bits from smaller tables. Without doing a precise calculation, I'll estimate the total as 2^{24.6}n bit operations.
This is vastly worse in this metric (and in reality) than the circuits from the 2020 paper, which take about 6n bit operations. On the other hand, to see the effect of this slowdown on the total attack cost, one also has to consider the cost of retrieving h(v), namely n·T^{0.5}/2^{5} in this metric. I assumed T was at least 2^{70}, so 2^{24.6}n bit operations add at most a few percent to n·T^{0.5}/2^{5}.
More generally, if you pick any particular size and structure for the clumping tables, then you'll be able to compute numbers C and R so that the clumping reduces "gate" counts from about 6n to about 6n/R, while costing about C·n bit operations in the NTRU Prime cost metric. As above, this clumping has no effect on the cost of retrieving h(v).
To recap the total costs of P iterations:
"Gates", 2020 algorithm: about 6n·P.
"Gates", clumping algorithm: about (6n/R)·P.
NTRU Prime cost metric, 2020 algorithm: about (n·T^{0.5}/2^{5})·P. I've suppressed the 6n here: changing from n·T^{0.5}/2^{5} to n·T^{0.5}/2^{5} + 6n makes a negligible difference under my assumption on T.
NTRU Prime cost metric, clumping algorithm: about (n·T^{0.5}/2^{5} + C·n)·P. I've suppressed the negligible 6n/R here, but either the T term or the C term could dominate depending on how large the clumping-table sizes are.
Clumping thus reduces "gate" counts but doesn't reduce costs in the NTRU Prime metric. Specifically, small clumping (such as 31-bit clumping) produces some reduction in "gate" counts and a marginal overall slowdown in the NTRU Prime cost metric; large clumping produces a larger reduction in "gate" counts and a drastic slowdown in the NTRU Prime cost metric.
The 2020 algorithm according to NIST's calculation. Let's now calculate what NIST's "40 bits of security more than would be suggested by the RAM model ... approximately 40 bits of additional security quoted as the 'real cost of memory access' by the NTRUprime submission" says about P iterations of the 2020 xor-and-popcount algorithm.
First question: what's the "RAM model"? NIST's posting says that this doesn't account for memory access ("The RAM model ignores the cost of this memory access"). NIST's posting plugs in "gate" counts from a paper saying that it's improving on the "gate" counts from the Kyber submission.
Readers will interpret this as referring to the same metric as "gates" in the Kyber submission, "instruction count" for "RAM machines" in the 2020 paper, etc. I'm going to take the same interpretation: P iterations of xor-and-popcount cost about 6n·P in "the RAM model".
As for the 2^{40} "real cost of memory access", one of the ways that NIST's posting interferes with review is by not stating the formula that this 2^{40} is supposed to be an example of. But NIST claimed that it was simply quoting this as the "real cost of memory access" from the NTRU Prime documentation, so let's look at what the NTRU Prime documentation actually says about that cost:
Underneath everything, there's the cost metric on page 57: "we estimate the cost of each access to a bit within N bits of memory as the cost of N^{0.5}/2^{5} bit operations."
Perhaps NIST obtained its 2^{40} by combining this with some estimate for the volume of memory access inside a Kyber-512 attack iteration. Much earlier, in email dated 9 Jun 2020 15:39:09 +0000, NIST had quoted the Kyber team estimating that an attack needs "2^{89} bits" of memory. (A closer look shows that the literature leading to this number was trying to optimize for "gates"; see above regarding the requirement for reoptimization when the cost metric changes.)
There's a scalability statement starting on page 65, derived from the 0.5 exponent in the cost metric: "Sieving algorithms in the literature typically use a massive database, size 2^{(0.2075...β+o(β))} (and often even larger). This multiplies the real cost of the algorithms by 2^{(0.1037...β+o(β))}; see Section 6.6."
Perhaps NIST obtained its "approximately 40" by taking β = 406 from the Kyber documentation (page 21) and computing 0.1037·406, about 42.1. (As a side note, this would mean that NIST ignored the o(β), which would be a mathematical error by NIST, no matter what quantification NIST has in mind for "approximately 40". For example, ignoring the o(β) means ignoring the 2^{5} denominator from the cost metric, and similarly ignores all subexponential numerators and denominators appearing in T as a function of β. The definition of "o(β)" in mathematics does not state that the result is within any particular range for any specific value of β; it's a purely asymptotic statement. NIST recently gave this o(β) quote and claimed that the quote supports something about 40; that's the same mathematical error.)
NIST's December 2022 posting told readers to "see section 6.11 of https://ntruprime.cr.yp.to/nist/ntruprime-20201007.pdf". That's a two-page section reporting many different strategies to estimate total attack cost. Two of the resulting estimates are 0.396β and 0.292β.
Perhaps NIST obtained its "approximately 40" by taking β = 406 as above and plugging that into the difference between 0.396β and 0.292β. (As a side note, this would mean that NIST was ignoring the warning close to the top of the section: "Given the issues described in Sections 6.4 through 6.9, it seems reasonably clear that all of these strategies are wrong, but that some strategies are more wrong than others.")
Perhaps NIST obtained its 40 by simply subtracting 129 from 169 at the top of Table 2 in the documentation, the table of security levels. (This remains my top theory for how NIST originally obtained its "approximately 40"; it's easier than the other ways of obtaining this number.)
This table row is for sntrup653, which is the smallest NTRU Prime option. The 2^{129} is "Core-SVP", which is a rough estimate from 2015 for the number of iterations in attacks. The 2^{169} multiplies this by a rough estimate of the memory-access cost per iteration.
Any of these can explain NIST claiming that the NTRU Prime documentation obtained approximately 2^{40} as "the real cost of memory access". Each explanation starts with the NTRU Prime cost metric, which says that accessing b bits in a T-bit table costs b·T^{0.5}/2^{5}. Each explanation then plugs in an estimate for (T,b) for each iteration. The explanations vary in precision, for example with some of the formulas suppressing the 2^{5} denominator, but ultimately all of the explanations come from the same metric.
For P iterations of xor-and-popcount with the 2020 algorithm, this b·T^{0.5}/2^{5} memory-access cost per iteration is exactly the n·T^{0.5}/2^{5} from earlier in this blog post. NIST's "approximately 40 bits of additional security quoted as the 'real cost of memory access' by the NTRUprime submission" is this n·T^{0.5}/2^{5}. NIST's "40 bits of security more than would be suggested by the RAM model" is telling us to multiply n·T^{0.5}/2^{5} by about 6n·P, obtaining about (n·T^{0.5}/2^{5})·6n·P.
For comparison, remember that this algorithm costs only about (n·T^{0.5}/2^{5})·P in the NTRU Prime cost metric. NIST's calculation is inflating this by a factor 6n, the number of "gates" per iteration. This inflation comes directly from the incorrect structure built into NIST's calculation, namely taking an estimate for the "real cost of memory access" and then multiplying that by a total "gate" count.
Clumping according to NIST's calculation. One more calculation remains: let's apply NIST's "40 bits of security more than would be suggested by the RAM model ... approximately 40 bits of additional security quoted as the 'real cost of memory access' by the NTRUprime submission" to P iterations of the clumped xor-and-popcount algorithm.
As above, I'm going to interpret the "RAM model" as referring to the same "gates" metric as in the Kyber submission, so P iterations of clumped xor-and-popcount cost only (6n/R)·P in "the RAM model".
Appendix D of the CryptAttackTester paper points out that there's an incompatible description of "the RAM model" on page 81 of the 2022 NIST selection report (which doesn't label this description as incompatible!). Appendix D shows that clumping reduces "gates" with that description too. The exact size of the reduction factor R doesn't matter for the calculations below.
As also covered above, the "40 bits" that NIST says it obtains from the NTRU Prime documentation as the "real cost of memory access" is ultimately coming from the NTRU Prime metric, which is n·T^{0.5}/2^{5} + C·n for each iteration of this algorithm.
NIST's multiplication of the "real cost of memory access" by cost in "the RAM model" then produces (n·T^{0.5}/2^{5} + C·n)·(6n/R)·P. This is again inflated compared to the NTRU Prime cost metric, but this time by a factor only 6n/R instead of 6n.
In particular, for 31-bit clumping, recall that R is well over 100, while the C·n term is negligible. NIST's calculation then says that 31-bit clumping reduces costs by a factor over 100. This reduction is not plausible as a statement about reality; more to the point, it is not what the NTRU Prime cost metric says.
Summary. The NTRU Prime documentation said "we estimate the cost of each access to a bit within N bits of memory as the cost of N^{0.5}/2^{5} bit operations", and gave a rough estimate of 2^{40} for the memory-access cost per iteration for sntrup653.
For each iteration of the 2020 xor-and-popcount algorithm with n-bit vectors and a T-bit table, the NTRU Prime cost metric says that memory access costs n·T^{0.5}/2^{5}. This is added to the 6n bit operations for XOR and Hamming-weight computation. Clumping reduces those bit operations but increases memory-access costs and total costs.
NIST's "approximately 40 bits of additional security quoted as the 'real cost of memory access' by the NTRUprime submission" is looking at the memory-access cost in each iteration. NIST's "40 bits of security more than would be suggested by the RAM model" is incorrectly multiplying the memory-access cost per iteration by bit operations.
NIST's calculation ends up inserting a fake factor 6n into the costs for the 2020 algorithm. It also inserts a fake factor into the costs for clumping, but a smaller factor than 6n because clumping uses fewer "gates". For more sophisticated attack algorithms that use millions or billions of "gates" to process the results of memory access in each iteration, NIST's calculation inflates costs by a factor of millions or billions.
That's the end of the algorithm comparison in this blog post.
Weaponizing ambiguity. As noted above, NIST has sabotaged review of its calculation in various ways, in particular by systematically dodging my clarification questions. This is an assault against falsifiability. Here I am doing all this work to debunk what NIST's words communicate to readers; but people might claim that, no, no, NIST actually meant something else instead.
I sometimes notice ambiguities in security claims and ask the authors what they meant. The authors clarify. What happened with NIST was different: NIST issued security claims, but then dodged clarification requests regarding those security claims.
Procedurally, if a security reviewer identifies what a security claim seems to be saying, asks for confirmation, and doesn't receive clarification one way or the other, then at some point the reviewer has to be able to say, okay, I'm going to review the simplest interpretation, and the source of the claim has lost its chance to retroactively "clarify" that it meant something else.
But maybe some readers aren't satisfied with a procedural answer. So let's focus on the content, and look at how hard it would be to build a coherent story around the claim that NIST's "40 bits of security more than would be suggested by the RAM model ... approximately 40 bits of additional security quoted as the 'real cost of memory access' by the NTRUprime submission" is matching what the NTRU Prime documentation actually said.
There are two clear constraints on the details of the story. Specifically, look at what's needed for NIST's "40 bits of security more than would be suggested by the RAM model", with 2^{40} arising as the "real cost of memory access", to match the NTRU Prime cost metric for the two xor-and-popcount algorithms considered above:
For the 2020 algorithm, to make the "real cost of memory access" times 6n match n·T^{0.5}/2^{5}, one needs the "real cost of memory access" to mean not simply the cost of the memory access in each iteration, but rather the ratio between the cost of the memory access in each iteration and the "gates" in each iteration.
For the clumping algorithm, to make 2^{40} times 6n/R match n·T^{0.5}/2^{5} + C·n, one needs the same thing, and one needs the "gates" concept to allow a giant memory-access operation as a single "gate", since the R part doesn't work without this operation.
And then, to justify NIST saying that this was "approximately 40 bits of additional security quoted as the 'real cost of memory access' by the NTRUprime submission", one would need to find quotes from the NTRU Prime documentation showing that the documentation was
computing the "gates" per iteration,
specifically using a notion of "gates" that allows "operations" as large as giant memory-access operations,
torturing the words "real cost of memory access" to mean the ratio between memory-access cost and "gates", and
claiming that this ratio is around 2^{40} for sizes around Kyber-512.
Every step of this fiction is contrary to something the NTRU Prime documentation does say:
Page 70 of the documentation cites the 2020 paper as giving a "reasonably precise analysis of an important subroutine inside sieving", but does not report the results in more detail than "polynomial". (The documentation continues by saying that "incorporating this polynomial into security analyses could lose accuracy overall" and giving one reason why.)
Section 6.6 of the documentation criticizes "operation" notions that allow large memory access as one "operation", and concludes that "simply counting 'operations' does not correctly assign cost exponents to algorithms".
Page 57 of the documentation says "we estimate the cost of each access to a bit within N bits of memory as the cost of N^{0.5}/2^{5} bit operations". This is not dividing by the number of "gates" carried out on the bits retrieved from memory.
As mentioned above, Table 2 of the documentation (page 103) shows the Core-SVP estimate of 2^{129} iterations for sntrup653, and shows an estimated memory-access cost of 2^{169}, with ratio 2^{40}.
For comparison, notice that if the ratio between the memory-access cost per iteration and the "gates" per iteration were around 2^{40} for Kyber-512 then the memory-access cost per iteration by itself would have been around 2^{52} for Kyber-512 (since we've seen that the "primary" component of the "gates" per iteration was estimated as costing thousands of "gates"), and around 2^{57} for sntrup653, which is nowhere near this 2^{40} ratio from Table 2.
But there's something even more bizarre about the claim that NIST's calculation matches the NTRU Prime cost metric. Here it comes.
What, according to this claim, is the complete calculation expressed by NIST's words "40 bits of security more than would be suggested by the RAM model"?
We've seen that this claim forces NIST's 2^{40} "real cost of memory access" to actually mean the ratio between the cost of the memory access per iteration and the "gates" per iteration. This means that the complete calculation is as follows:
Take the cost of the memory access in each iteration.
Divide that by the number of "gates" in each iteration.
Then multiply again by "gates".
In formulas: if the "real cost of memory access" isn't supposed to be memcost/iter, but rather the ratio (memcost/iter)/(bitopscost/iter), then NIST's "40 bits of security more than would be suggested by the RAM model" isn't supposed to be computing (memcost/iter) · bitopscost, but rather ((memcost/iter)/(bitopscost/iter)) · bitopscost.
Why would anyone want to carry out the computation this way, instead of simply multiplying memcost/iter by iter? Why bother computing "gates" and dividing something else by "gates" and then multiplying in a way to cancel out the "gates"? This makes no sense.
It's not as if the ratio in this story, (memcost/iter)/(bitopscost/iter), has some independent meaning as an important algorithmic constant that could be deduced without looking at bitopscost/iter in the first place. Memory access and "gates" are separate variables. Sure, some algorithm changes have the same effect on memory access and "gates", but some algorithm changes affect just or one or the other, or provide tradeoffs between the two. For example, clumping reduces "gates", but doesn't reduce memory access, and doesn't reduce cost in the NTRU Prime cost metric.
In the latter metric (and any other realistic model of the cost of memory access), the bit operations for computations inside xor-and-popcount are obviously dwarfed by the cost of memory access as soon as T is reasonably large. These costs add; they don't multiply. So one doesn't end up seeing the "gates" in the final cost tallies.
NIST's "40 bits of security more than would be suggested by the RAM model" is
taking numbers in "the RAM model" that include, among other things, the "gates" for P xor-and-popcount operations, and then
multiplying those "gates" by about 2^{40} for the "real cost of memory access" ("approximately 40 bits of additional security quoted as the 'real cost of memory access' by the NTRUprime submission").
The only way to have "gates" correctly disappear from the product is to redefine the "real cost of memory access" to have "gates" as a denominator; but then where is the 40 supposed to be coming from? This redefinition turns "approximately 40 bits of additional security quoted as the 'real cost of memory access' by the NTRUprime submission" into a free-floating fantasy, not just divorced from what any reader could reasonably extract from NIST's words but also divorced from what the NTRU Prime documentation actually said.
To summarize, the ambiguities in NIST's "40 bits of security more than would be suggested by the RAM model ... approximately 40 bits of additional security quoted as the 'real cost of memory access' by the NTRUprime submission" don't lead to any universe in which NIST got this right. These ambiguities simply waste time for people checking NIST's work. In this case, NIST's work is a botched security-level calculation leading directly to NIST's selection of Kyber-512 for standardization.
I'll close this blog post by letting the reader contemplate words from Matt Scholl, chief of NIST's Computer Security Division, a fuller version of a quote given above: "We operate transparently. We’ve shown all our work and ensured that there’s traceability so that folks can see from the last selection round all the way back to what we selected, why and when."