cr.yp.to: 2023.10.23: Reducing "gate" counts for Kyber-512

Newer (Access-K): 2023.11.25: Another way to botch the security analysis of Kyber-512: Responding to a recent blog post. #nist #uncertainty #errorbars #quantification

Older (Access-J): 2023.10.03: The inability to count correctly: Debunking NIST's calculation of the Kyber-512 security level. #nist #addition #multiplication #ntru #kyber #fiasco

Table of contents (Access-I for index page)

2025.04.23: McEliece standardization: Looking at what's happening, and analyzing rationales. #nist #iso #deployment #performance #security

2025.01.18: As expensive as a plane flight: Looking at some claims that quantum computers won't work. #quantum #energy #variables #errors #rsa #secrecy

2024.10.28: The sins of the 90s: Questioning a puzzling claim about mass surveillance. #attackers #governments #corporations #surveillance #cryptowars

2024.08.03: Clang vs. Clang: You're making Clang angry. You wouldn't like Clang when it's angry. #compilers #optimization #bugs #timing #security #codescans

2024.06.12: Bibliography keys: It's as easy as [1], [2], [3]. #bibliographies #citations #bibtex #votemanipulation #paperwriting

2024.01.02: Double encryption: Analyzing the NSA/GCHQ arguments against hybrids. #nsa #quantification #risks #complexity #costs

2023.11.25: Another way to botch the security analysis of Kyber-512: Responding to a recent blog post. #nist #uncertainty #errorbars #quantification

2023.10.23: Reducing "gate" counts for Kyber-512: Two algorithm analyses, from first principles, contradicting NIST's calculation. #xor #popcount #gates #memory #clumping

2023.10.03: The inability to count correctly: Debunking NIST's calculation of the Kyber-512 security level. #nist #addition #multiplication #ntru #kyber #fiasco

2023.06.09: Turbo Boost: How to perpetuate security problems. #overclocking #performancehype #power #timing #hertzbleed #riskmanagement #environment

2022.08.05: NSA, NIST, and post-quantum cryptography: Announcing my second lawsuit against the U.S. government. #nsa #nist #des #dsa #dualec #sigintenablingproject #nistpqc #foia

2022.01.29: Plagiarism as a patent amplifier: Understanding the delayed rollout of post-quantum cryptography. #pqcrypto #patents #ntru #lpr #ding #peikert #newhope

2020.12.06: Optimizing for the wrong metric, part 1: Microsoft Word: Review of "An Efficiency Comparison of Document Preparation Systems Used in Academic Research and Development" by Knauff and Nejasmic. #latex #word #efficiency #metrics

2019.10.24: Why EdDSA held up better than ECDSA against Minerva: Cryptosystem designers successfully predicting, and protecting against, implementation failures. #ecdsa #eddsa #hnp #lwe #bleichenbacher #bkw

2019.04.30: An introduction to vectorization: Understanding one of the most important changes in the high-speed-software ecosystem. #vectorization #sse #avx #avx512 #antivectors

2017.11.05: Reconstructing ROCA: A case study of how quickly an attack can be developed from a limited disclosure. #infineon #roca #rsa

2017.10.17: Quantum algorithms to find collisions: Analysis of several algorithms for the collision problem, and for the related multi-target preimage problem. #collision #preimage #pqcrypto

2017.07.23: Fast-key-erasure random-number generators: An effort to clean up several messes simultaneously. #rng #forwardsecrecy #urandom #cascade #hmac #rekeying #proofs

2017.07.19: Benchmarking post-quantum cryptography: News regarding the SUPERCOP benchmarking system, and more recommendations to NIST. #benchmarking #supercop #nist #pqcrypto

2016.10.30: Some challenges in post-quantum standardization: My comments to NIST on the first draft of their call for submissions. #standardization #nist #pqcrypto

2016.06.07: The death of due process: A few notes on technology-fueled normalization of lynch mobs targeting both the accuser and the accused. #ethics #crime #punishment

2016.05.16: Security fraud in Europe's "Quantum Manifesto": How quantum cryptographers are stealing a quarter of a billion Euros from the European Commission. #qkd #quantumcrypto #quantummanifesto

2016.03.15: Thomas Jefferson and Apple versus the FBI: Can the government censor how-to books? What if some of the readers are criminals? What if the books can be understood by a computer? An introduction to freedom of speech for software publishers. #censorship #firstamendment #instructions #software #encryption

2015.11.20: Break a dozen secret keys, get a million more for free: Batch attacks are often much more cost-effective than single-target attacks. #batching #economics #keysizes #aes #ecc #rsa #dh #logjam

2015.03.14: The death of optimizing compilers: Abstract of my tutorial at ETAPS 2015. #etaps #compilers #cpuevolution #hotspots #optimization #domainspecific #returnofthejedi

2015.02.18: Follow-You Printing: How Equitrac's marketing department misrepresents and interferes with your work. #equitrac #followyouprinting #dilbert #officespaceprinter

2014.06.02: The Saber cluster: How we built a cluster capable of computing 3000000000000000000000 multiplications per year for just 50000 EUR. #nvidia #linux #howto

2014.05.17: Some small suggestions for the Intel instruction set: Low-cost changes to CPU architecture would make cryptography much safer and much faster. #constanttimecommitment #vmul53 #vcarry #pipelinedocumentation

2014.04.11: NIST's cryptographic standardization process: The first step towards improvement is to admit previous failures. #standardization #nist #des #dsa #dualec #nsa

2014.03.23: How to design an elliptic-curve signature system: There are many choices of elliptic-curve signature systems. The standard choice, ECDSA, is reasonable if you don't care about simplicity, speed, and security. #signatures #ecc #elgamal #schnorr #ecdsa #eddsa #ed25519

2014.02.13: A subfield-logarithm attack against ideal lattices: Computational algebraic number theory tackles lattice-based cryptography.

2014.02.05: Entropy Attacks! The conventional wisdom says that hash outputs can't be controlled; the conventional wisdom is simply wrong.

2023.10.23: Reducing "gate" counts for Kyber-512: Two algorithm analyses, from first principles, contradicting NIST's calculation. #xor #popcount #gates #memory #clumping

Tung Chou and I have a new software framework called CryptAttackTester for high-assurance quantification of the costs of cryptographic attack algorithms. So far the framework includes two case studies: brute-force AES-128 key search, and, as a deeper case study, "information-set decoding" attacks against the McEliece cryptosystem. The accompanying paper also looks at other attacks, covering various old and new examples of how attack analyses have gone wrong.

One of the appendices in the paper, Appendix D, chops almost 10 bits out of the "gate" count for "primal" attacks against Kyber-512. This reduction uses a technique that in this blog post I'll call "clumping". Clumping should also reduce the "gate" counts for "dual" attacks, but for this blog post it suffices to consider primal attacks.

The reason I'm putting "gate" in quotes here is that this is using the concept of "gates" in the Kyber documentation. As we'll see below, this concept doesn't match what hardware designers call gates, in particular in its handling of memory access.

NIST's calculation is incorrectly replacing addition of memcost/iter and bitopscost/iter with multiplication of memcost/iter and bitopscost/iter. The multiplication is nonsense: it doesn't even pass basic type-checking.

Numerically, if memcost/iter is larger than bitopscost/iter, then NIST's calculation ends up multiplying the correct result by a fake bitopscost/iter factor. Clumping reduces that factor, as we'll see below.

Relationship to surrounding events. Readers who simply want to check what I'm saying above can skip past this section and read about the algorithms. But I think many readers will also be interested in the important procedural question of what previous review took place for NIST's calculation.

The starting point was NIST carrying out its analysis in secret, including consultations with the Kyber team. The records of those consultations still aren't public. NIST illegally stonewalled in response to my subsequent FOIA request. I've filed a FOIA lawsuit, but lawsuits take time to resolve.

In November 2022, NIST announced its conclusion that Kyber-512 was "unlikely" to be less secure than AES-128 when memory-access costs are taken into account. NIST didn't explain how it reached this conclusion. I had to ask repeatedly before NIST posted its calculation.

NIST's posting phrased the calculation in an unnecessarily obfuscated form, avoiding the normal tools that scientists use to help reviewers catch any errors that occur: definitions, formulas, double-checks, etc. NIST then repeatedly dodged my clarification questions about the calculation.

Eventually I decided that NIST's stonewalling couldn't be allowed to halt security review. So I wrote my previous blog post. That post summarizes NIST's calculation error, explains how important the error is in the context of NIST's decisions regarding Kyber, and goes line by line through what NIST had posted.

I also sent NIST's mailing list a much more concise sanity check that's flunked by NIST's calculation: an example of an attack where NIST's "40 bits of security more than would be suggested by the RAM model" is numerically very far above the estimated memcost and bitopscost in NIST's alleged sources. The critical point isn't the magnitude of this particular gap; the critical point is that NIST is inflating its security-level claims by using a cost calculation that's fundamentally wrong.

To highlight what exactly I was challenging, I started my mailing-list message with the critical NIST quote and a summary of my challenge:

Some of the followups to my message were obviously off-topic, such as ad-hominem attacks, praise for Kyber-512's efficiency, and praise for NIST manipulating its minimum criteria to allow Kyber-512 (seriously: "adjust the definition of security category 1 if needed, but make sure to standardize ML-KEM-512").

Four replies sounded on-topic but were actually dodging. One structural aspect of the dodging is easy to see: the replies being portrayed as defenses of the challenged text don't quote the challenged text.

Regarding content details, the most important dodge is as follows. The dispute at hand is between the following two calculations:

The dodge is to hype the undisputed part—the multiplication by iter or, in more detail, various factors inside iter—while ignoring the disputed part—the multiplication of memcost/iter by bitopscost/iter.

My previous blog post had already emphasized the distinction between these two parts ("The research that would be needed for a correct calculation. To fix NIST's calculation, one needs to carefully distinguish two different effects: ..."). As part of describing this distinction, my previous blog post had emphasized that the memcost/iter · bitopscost/iter multiplication is wrong ("exactly the central mistake highlighted in this blog post") while the other multiplication is fine ("Multiplying the new iteration count by the cost of memory access per iteration makes perfect sense"). But it takes time for readers to read through everything and see that the replies to my message are misrepresenting the topic of dispute.

A further review difficulty is that the numbers showing up in the sanity check, such as 2¹⁵¹ from the Kyber documentation, are coming from very complicated algorithm analyses in the literature with very large uncertainties. It's terribly time-consuming for readers to go through those analyses.

I've realized that there's a much easier way to see that NIST's calculation is wrong: evaluate, from first principles, the cost of (1) a simple algorithm for carrying out a simple operation and (2) the same algorithm plus clumping. Again, what's important about clumping here is that it reduces bitopscost/iter without reducing memcost/iter.

The xor-and-popcount operation. The simple operation analyzed in this blog post is the operation stated on page 11 of an Asiacrypt 2020 paper: "load h(v) from memory, compute the Hamming weight of h(u) ⊕ h(v), and check whether the Hamming weight is less than or equal to k".

This blog post focuses on the cost of carrying out many iterations, let's say P iterations, of this xor-and-popcount operation. I'll explain what the 2020 paper says about the cost, explain how clumping does better, look at what the NTRU Prime cost metric says about the costs of the non-clumped and clumped algorithms, and look at what NIST's calculation says about these algorithms.

This xor-and-popcount operation is directly on point: it's the core of the primal attack considered in the latest Kyber documentation. (Readers who simply want to see the algorithm analysis and don't care about this context can skip to the next section.)

I'm not assuming that readers are familiar with this attack, and I'm also not asking readers to trust me that the xor-and-popcount operation analyzed here is the core of that attack. This is easy to check. Here's an excerpt from Appendix D of the CryptAttackTester paper saying how to check, starting from the latest Kyber specification:

The underlying paper "[9]" is the 2020 paper. That paper has "quantum" in the title but also considers non-quantum computations; the 2^151.5 is non-quantum, and I'm similarly focusing on non-quantum computations throughout this blog post.

The "primary optimisation target" quote comes from page 2 of the 2020 paper. The "loads" quote comes from page 11. The 2020 paper goes on to say h(u) is cached, so its xor-and-popcount algorithms simply "load h(v) from memory, compute the Hamming weight of h(u) ⊕ h(v), and check whether the Hamming weight is less than or equal to k". That's the xor-and-popcount description I gave above.

As a side note, Appendix D of the CryptAttackTester paper also explains how clumping reduces "gate" counts for the secondary "inner products" operation (which was briefly mentioned in one quote above). For this blog post, there's no reason to go through this extra work. Simply comparing two algorithms for the primary xor-and-popcount bottleneck will show how NIST is misevaluating algorithm costs.

"Gates" for the 2020 algorithm. Let's start with the "Hamming weight" computation.

The Hamming weight of a bit vector is simply the sum of the bits of the vector, i.e., the number of bits set to 1. For example, the Hamming weight of (1,0,1,1,0) is 3.

How do you build a circuit to add two bits (x,y), representing the output as a 2-bit result? The obvious approach is to separately compute the bottom bit, which is x XOR y, and the top bit, which is x AND y; I'll write this in little-endian order as (x XOR y,x AND y). XOR and AND are among the "gates" allowed in the 2020 paper, so this computation costs 2 "gates".

How about three bits x,y,z? The easy part is that the bottom bit of the result is x XOR y XOR z. The top bit is set if the majority of x,y,z are 1; one way to compute this is as (x AND y) OR ((x XOR y) AND z). This might look like 6 "gates", but the x XOR y is shared, giving just 5 "gates".

Another approach is to split three-bit addition into first adding x and y, giving (x XOR y,x AND y); then adding x XOR y to z, giving (x XOR y XOR z,(x XOR y) AND z); and then adding the two top bits, giving (x AND y) XOR ((x XOR y) AND z), obviously with no carry possible into a further bit since x+y+z always fits into 2 bits. This is essentially the same as the formula from the previous paragraph, except for XOR vs. OR, either of which is allowed as a cost-1 "gate".

If there are 15 bits to add, then the same standard approach takes 20 "gates" to add 7 bits, 20 "gates" to add another 7 bits, and then 15 "gates" to add the intermediate 3-bit results to the last bit, for a total of 55 "gates".

More generally, if there are n bits when n is 2^ℓ−1, then this approach uses 5(n−ℓ) "gates", slightly smaller than 5n. You can also use an unbalanced split to handle intermediate sizes of n.

The 2020 paper concludes on page 13 that the "overall instruction count is 6n−4ℓ−5" where ℓ is the number of bits in n. Multiplying this instruction count by P for P xor-and-popcount iterations gives a total of (6n−4ℓ−5)·P.

The reason that the main term in the 2020 paper is 6n instead of 5n is that there's also an initial computation of "h(u) ⊕ h(v)", which the paper straightforwardly handles with n XOR "gates". There's also a small computation at the end to check whether the Hamming weight "is less than or equal to k". As for retrieving the h(v) input in the first place (remember that h(u) was cached), the 2020 paper says on page 13 that "loading h(v) has cost 1".

Suddenly the hardware designers in my audience are jumping up to object: "Wait, what? They think it costs just 1 gate to do an entire memory lookup for h(v)?"

As I said before, the set of "gates" that the Kyber documentation is talking about isn't what hardware designers expect. In particular, it includes a "gate" that carries out a memory lookup in an arbitrarily large array. The way page 4 of the 2020 paper phrases this is that the paper considers "programs for RAM machines (random access memory machines)" where the instruction set has "NOT, AND, OR, XOR, LOAD, STORE" operations and "the cost of a RAM program is the number of instructions that it performs". So "LOAD" has cost 1 by definition, just like XOR.

In the case of Kyber-512, the 2020 paper ends up selecting n = 511. (Page 23 of the paper says "we only consider values of the popcount parameter n that are one less than a power of two"; the code at the end of the paper keeps doubling until it reaches or exceeds the input dimension.) For this choice of n, the "instruction count" in the 2020 paper—the number of "gates" in the Kyber documentation—is 3025 per xor-and-popcount, or 3025·P for the total algorithm.

"Gates" for a clumped algorithm. Clumping does a better job of exploiting the 2020 paper's declaration that "LOAD" (e.g., "loading h(v)") has cost just 1.

As a small example, say we want to add 7 bits a,b,c,d,e,f,g. Instead of spending 20 "gates" on bit operations as above, we can spend just 1 "gate" on a table lookup indexed by (a,b,c,d,e,f,g). The table has 2⁷ entries, each with a 3-bit output.

Compared to what the 2020 paper achieves in its selected metric for its "primary optimisation target", suddenly there's a factor 20 disappearing!

(Did you think clumping was going to be something difficult? Why? Because the 2020 paper was at a flagship IACR conference? Because NIST claimed on page 18 of its 2022 selection report that the Kyber documentation "included a thorough and detailed security analysis"?)

Back to the algorithm analysis. Can we really get this factor 20 compared to the original 6n gates, if the 6n includes n XORs to compute h(u) ⊕ h(v) in the first place?

Yes, we can, by simply absorbing the XORs into the table lookups. Say we're starting with a,b,c,d,e,f,g and A,B,C,D,E,F,G, and we want to compute (a XOR A) + (b XOR B) + (c XOR C) + (d XOR D) + (e XOR E) + (f XOR F) + (g XOR G). A table lookup indexed by the 14 input bits costs just 1 "gate".

There's no reason to stop with 7+7 bits. Larger and larger tables reduce the "gate" counts more and more, as long as the tables don't become large enough for the initial table-computation time to be a bottleneck. For the cost evaluations in this blog post, I'll assume that P, the number of xor-and-popcount iterations, is large enough to make the table-computation time negligible.

For example, let's say we take a table of size 2⁶², mapping two 31-bit inputs to the 5-bit sum of bits of the xor of the inputs. This is what I'm calling "31-bit clumping". The Kyber-512 attacks in the Kyber documentation use more memory than this, and the table-computation time isn't a bottleneck.

We can use this table of size 2⁶² to handle n-bit vectors h(u) and h(v). Take 31-bit pieces of h(u) and h(v), and do one table lookup to compress each piece of h(u) and the corresponding piece of h(v) to a 5-bit sum. This uses just ceil(n/31) "gates", and reduces the amount of data to handle by a factor close to 12.4. Then apply another table of size just 2⁵⁰ that takes 10 5-bit sums and produces an 8-bit output; ceil(n/310) "gates" then compress the amount of data to handle by another factor close to 6.25. Et cetera.

Compared to the original number of "gates" (around 6·n), 31-bit clumping is saving well over a factor 100 for, e.g., the n = 511 used in the 2020 paper for Kyber-512.

Appendix D of the CryptAttackTester paper uses larger table sizes and gets the "gate" count for 511-bit xor-and-popcount down to 8, which is 378 times smaller than the 3025 "gates" from the 2020 paper. That's the almost-10-bit improvement mentioned above. (To be more precise, it's an 8.56-bit improvement for the xor-and-popcount operation. For a full analysis of the impact on the primal attack, one also needs to quantify the speedup for "inner products" et al.; this blog post focuses on xor-and-popcount.)

The 2020 algorithm in the NTRU Prime cost metric. Page 57 of the NTRU Prime documentation says "we estimate the cost of each access to a bit within N bits of memory as the cost of N^0.5/2⁵ bit operations".

The documentation explains how the N^0.5 comes from standard two-dimensional models of circuits, and explains how the 2⁵ denominator is derived from energy figures published by Intel. Earlier versions of the NTRU Prime documentation had simply used N^0.5 without trying to estimate this denominator.

(The documentation also mentions that three-dimensional models reduce exponent 1/2 to 1/3, but that it isn't known how to do better than 1/2 in metrics that account for the cost of energy transmission. One can spend endless time considering the impact of a long list of different metrics. What matters for evaluating NIST's "40 bits of security more than would be suggested by the RAM model ... approximately 40 bits of additional security quoted as the 'real cost of memory access' by the NTRUprime submission" claim is specifically the NTRU Prime cost metric.)

The total cost of P xor-and-popcount iterations is then (n·T^0.5/2⁵ + 6n)·P.

If T is 2⁷⁰ then n·T^0.5/2⁵ is 2³⁰·n, which is on a much larger scale than 6n. I'll assume that T is at least this large. The total cost is then basically (n·T^0.5/2⁵)·P.

Algorithm designers looking at these numbers will be thinking something like this: "Hmmm, given that the memory access is so expensive, what else can I do with low-cost bit operations to extract more value out of the data retrieved from memory, and ultimately to get the same job done with fewer iterations?" As my previous blog post says, "a change of cost metric also forces reoptimization of the entire stack of attack subroutines, along with all applicable parameters".

If the job is defined as P iterations of xor-and-popcount, as in this blog post, then, well, that's a job inherently dominated by memory access in any realistic cost metric once the table sizes are large. NIST's miscalculation (see below) is inflating the cost by a factor of "only" 3000 for the sizes of interest here.

But if the job is actually to break a lattice system, then xor-and-popcount is just one of many options in the literature, an option that has attracted attention because of its low "gate" count. Someone trying to optimize large-scale attacks in a realistic cost metric will consider more sophisticated inner loops with smaller iteration counts even when each iteration has "gate" counts in the millions or billions, also meaning that NIST's inflation factor is millions or billions.

Anyway, this blog post focuses on the 2020 algorithm for xor-and-popcount and a clumped version of that. This is enough to see that NIST used the wrong calculation.

The clumping algorithm in the NTRU Prime cost metric. Status so far: we have the "gate" counts for xor-and-popcount with and without clumping, and the NTRU Prime metric for xor-and-popcount without clumping. Next step: NTRU Prime metric for xor-and-popcount with clumping.

Consider, as one of the concrete examples from above, 31-bit clumping: clumping 31-bit xor-and-popcount operations into a 2⁶²-entry table. The table has 5-bit outputs, so 5·2⁶² bits overall. The cost of each access to a bit within 5·2⁶² bits is, in this metric, the cost of (5·2⁶²)^0.5/2⁵ bit operations, so retrieving 5 bits costs 5·(5·2⁶²)^0.5/2⁵ bit operations. If that table is reused for n-bit xor-and-popcount operations, then there are about n/31 of these table lookups, together costing 5·n·(5·2⁶²)^0.5/(2⁵·31) bit operations, i.e., about 2^24.53n bit operations.

There are then further table lookups for adding the 5-bit results, but those are retrieving a smaller number of bits from smaller tables. Without doing a precise calculation, I'll estimate the total as 2^24.6n bit operations.

This is vastly worse in this metric (and in reality) than the circuits from the 2020 paper, which take about 6n bit operations. On the other hand, to see the effect of this slowdown on the total attack cost, one also has to consider the cost of retrieving h(v), namely n·T^0.5/2⁵ in this metric. I assumed T was at least 2⁷⁰, so 2^24.6n bit operations add at most a few percent to n·T^0.5/2⁵.

More generally, if you pick any particular size and structure for the clumping tables, then you'll be able to compute numbers C and R so that the clumping reduces "gate" counts from about 6n to about 6n/R, while costing about C·n bit operations in the NTRU Prime cost metric. As above, this clumping has no effect on the cost of retrieving h(v).

Clumping thus reduces "gate" counts but doesn't reduce costs in the NTRU Prime metric. Specifically, small clumping (such as 31-bit clumping) produces some reduction in "gate" counts and a marginal overall slowdown in the NTRU Prime cost metric; large clumping produces a larger reduction in "gate" counts and a drastic slowdown in the NTRU Prime cost metric.

The 2020 algorithm according to NIST's calculation. Let's now calculate what NIST's "40 bits of security more than would be suggested by the RAM model ... approximately 40 bits of additional security quoted as the 'real cost of memory access' by the NTRUprime submission" says about P iterations of the 2020 xor-and-popcount algorithm.

First question: what's the "RAM model"? NIST's posting says that this doesn't account for memory access ("The RAM model ignores the cost of this memory access"). NIST's posting plugs in "gate" counts from a paper saying that it's improving on the "gate" counts from the Kyber submission.

Readers will interpret this as referring to the same metric as "gates" in the Kyber submission, "instruction count" for "RAM machines" in the 2020 paper, etc. I'm going to take the same interpretation: P iterations of xor-and-popcount cost about 6n·P in "the RAM model".

As for the 2⁴⁰ "real cost of memory access", one of the ways that NIST's posting interferes with review is by not stating the formula that this 2⁴⁰ is supposed to be an example of. But NIST claimed that it was simply quoting this as the "real cost of memory access" from the NTRU Prime documentation, so let's look at what the NTRU Prime documentation actually says about that cost:

Any of these can explain NIST claiming that the NTRU Prime documentation obtained approximately 2⁴⁰ as "the real cost of memory access". Each explanation starts with the NTRU Prime cost metric, which says that accessing b bits in a T-bit table costs b·T^0.5/2⁵. Each explanation then plugs in an estimate for (T,b) for each iteration. The explanations vary in precision, for example with some of the formulas suppressing the 2⁵ denominator, but ultimately all of the explanations come from the same metric.

For P iterations of xor-and-popcount with the 2020 algorithm, this b·T^0.5/2⁵ memory-access cost per iteration is exactly the n·T^0.5/2⁵ from earlier in this blog post. NIST's "approximately 40 bits of additional security quoted as the 'real cost of memory access' by the NTRUprime submission" is this n·T^0.5/2⁵. NIST's "40 bits of security more than would be suggested by the RAM model" is telling us to multiply n·T^0.5/2⁵ by about 6n·P, obtaining about (n·T^0.5/2⁵)·6n·P.

For comparison, remember that this algorithm costs only about (n·T^0.5/2⁵)·P in the NTRU Prime cost metric. NIST's calculation is inflating this by a factor 6n, the number of "gates" per iteration. This inflation comes directly from the incorrect structure built into NIST's calculation, namely taking an estimate for the "real cost of memory access" and then multiplying that by a total "gate" count.

Clumping according to NIST's calculation. One more calculation remains: let's apply NIST's "40 bits of security more than would be suggested by the RAM model ... approximately 40 bits of additional security quoted as the 'real cost of memory access' by the NTRUprime submission" to P iterations of the clumped xor-and-popcount algorithm.

As above, I'm going to interpret the "RAM model" as referring to the same "gates" metric as in the Kyber submission, so P iterations of clumped xor-and-popcount cost only (6n/R)·P in "the RAM model".

Appendix D of the CryptAttackTester paper points out that there's an incompatible description of "the RAM model" on page 81 of the 2022 NIST selection report (which doesn't label this description as incompatible!). Appendix D shows that clumping reduces "gates" with that description too. The exact size of the reduction factor R doesn't matter for the calculations below.

As also covered above, the "40 bits" that NIST says it obtains from the NTRU Prime documentation as the "real cost of memory access" is ultimately coming from the NTRU Prime metric, which is n·T^0.5/2⁵ + C·n for each iteration of this algorithm.

NIST's multiplication of the "real cost of memory access" by cost in "the RAM model" then produces (n·T^0.5/2⁵ + C·n)·(6n/R)·P. This is again inflated compared to the NTRU Prime cost metric, but this time by a factor only 6n/R instead of 6n.

In particular, for 31-bit clumping, recall that R is well over 100, while the C·n term is negligible. NIST's calculation then says that 31-bit clumping reduces costs by a factor over 100. This reduction is not plausible as a statement about reality; more to the point, it is not what the NTRU Prime cost metric says.

Summary. The NTRU Prime documentation said "we estimate the cost of each access to a bit within N bits of memory as the cost of N^0.5/2⁵ bit operations", and gave a rough estimate of 2⁴⁰ for the memory-access cost per iteration for sntrup653.

For each iteration of the 2020 xor-and-popcount algorithm with n-bit vectors and a T-bit table, the NTRU Prime cost metric says that memory access costs n·T^0.5/2⁵. This is added to the 6n bit operations for XOR and Hamming-weight computation. Clumping reduces those bit operations but increases memory-access costs and total costs.

NIST's "approximately 40 bits of additional security quoted as the 'real cost of memory access' by the NTRUprime submission" is looking at the memory-access cost in each iteration. NIST's "40 bits of security more than would be suggested by the RAM model" is incorrectly multiplying the memory-access cost per iteration by bit operations.

NIST's calculation ends up inserting a fake factor 6n into the costs for the 2020 algorithm. It also inserts a fake factor into the costs for clumping, but a smaller factor than 6n because clumping uses fewer "gates". For more sophisticated attack algorithms that use millions or billions of "gates" to process the results of memory access in each iteration, NIST's calculation inflates costs by a factor of millions or billions.

Weaponizing ambiguity. As noted above, NIST has sabotaged review of its calculation in various ways, in particular by systematically dodging my clarification questions. This is an assault against falsifiability. Here I am doing all this work to debunk what NIST's words communicate to readers; but people might claim that, no, no, NIST actually meant something else instead.

I sometimes notice ambiguities in security claims and ask the authors what they meant. The authors clarify. What happened with NIST was different: NIST issued security claims, but then dodged clarification requests regarding those security claims.

Procedurally, if a security reviewer identifies what a security claim seems to be saying, asks for confirmation, and doesn't receive clarification one way or the other, then at some point the reviewer has to be able to say, okay, I'm going to review the simplest interpretation, and the source of the claim has lost its chance to retroactively "clarify" that it meant something else.

But maybe some readers aren't satisfied with a procedural answer. So let's focus on the content, and look at how hard it would be to build a coherent story around the claim that NIST's "40 bits of security more than would be suggested by the RAM model ... approximately 40 bits of additional security quoted as the 'real cost of memory access' by the NTRUprime submission" is matching what the NTRU Prime documentation actually said.

There are two clear constraints on the details of the story. Specifically, look at what's needed for NIST's "40 bits of security more than would be suggested by the RAM model", with 2⁴⁰ arising as the "real cost of memory access", to match the NTRU Prime cost metric for the two xor-and-popcount algorithms considered above:

And then, to justify NIST saying that this was "approximately 40 bits of additional security quoted as the 'real cost of memory access' by the NTRUprime submission", one would need to find quotes from the NTRU Prime documentation showing that the documentation was

Every step of this fiction is contrary to something the NTRU Prime documentation does say:

But there's something even more bizarre about the claim that NIST's calculation matches the NTRU Prime cost metric. Here it comes.

What, according to this claim, is the complete calculation expressed by NIST's words "40 bits of security more than would be suggested by the RAM model"?

We've seen that this claim forces NIST's 2⁴⁰ "real cost of memory access" to actually mean the ratio between the cost of the memory access per iteration and the "gates" per iteration. This means that the complete calculation is as follows:

In formulas: if the "real cost of memory access" isn't supposed to be memcost/iter, but rather the ratio (memcost/iter)/(bitopscost/iter), then NIST's "40 bits of security more than would be suggested by the RAM model" isn't supposed to be computing (memcost/iter) · bitopscost, but rather ((memcost/iter)/(bitopscost/iter)) · bitopscost.

Why would anyone want to carry out the computation this way, instead of simply multiplying memcost/iter by iter? Why bother computing "gates" and dividing something else by "gates" and then multiplying in a way to cancel out the "gates"? This makes no sense.

It's not as if the ratio in this story, (memcost/iter)/(bitopscost/iter), has some independent meaning as an important algorithmic constant that could be deduced without looking at bitopscost/iter in the first place. Memory access and "gates" are separate variables. Sure, some algorithm changes have the same effect on memory access and "gates", but some algorithm changes affect just or one or the other, or provide tradeoffs between the two. For example, clumping reduces "gates", but doesn't reduce memory access, and doesn't reduce cost in the NTRU Prime cost metric.

In the latter metric (and any other realistic model of the cost of memory access), the bit operations for computations inside xor-and-popcount are obviously dwarfed by the cost of memory access as soon as T is reasonably large. These costs add; they don't multiply. So one doesn't end up seeing the "gates" in the final cost tallies.

The only way to have "gates" correctly disappear from the product is to redefine the "real cost of memory access" to have "gates" as a denominator; but then where is the 40 supposed to be coming from? This redefinition turns "approximately 40 bits of additional security quoted as the 'real cost of memory access' by the NTRUprime submission" into a free-floating fantasy, not just divorced from what any reader could reasonably extract from NIST's words but also divorced from what the NTRU Prime documentation actually said.

To summarize, the ambiguities in NIST's "40 bits of security more than would be suggested by the RAM model ... approximately 40 bits of additional security quoted as the 'real cost of memory access' by the NTRUprime submission" don't lead to any universe in which NIST got this right. These ambiguities simply waste time for people checking NIST's work. In this case, NIST's work is a botched security-level calculation leading directly to NIST's selection of Kyber-512 for standardization.

The cr.yp.to blog

2023.10.23: Reducing "gate" counts for Kyber-512: Two algorithm analyses, from first principles, contradicting NIST's calculation. #xor #popcount #gates #memory #clumping