cr.yp.to: 2014.05.17: Some small suggestions for the Intel instruction set

Newer (Access-K): 2014.06.02: The Saber cluster: How we built a cluster capable of computing 3000000000000000000000 multiplications per year for just 50000 EUR. #nvidia #linux #howto

Older (Access-J): 2014.04.11: NIST's cryptographic standardization process: The first step towards improvement is to admit previous failures. #standardization #nist #des #dsa #dualec #nsa

Table of contents (Access-I for index page)

2026.04.05: NSA and IETF, part 7: Counting votes. #pqcrypto #hybrids #nsa #ietf #voting

2026.02.21: NSA and IETF, part 6: The structure of the debate. #pqcrypto #hybrids #nsa #ietf #chart

2026.02.19: NSA and IETF, part 5: One battle after another. #pqcrypto #hybrids #nsa #ietf #lastcall

2025.11.23: NSA and IETF, part 4: An example of censored dissent. #pqcrypto #hybrids #nsa #ietf #scope

2025.11.23: NSA and IETF, part 3: Dodging the issues at hand. #pqcrypto #hybrids #nsa #ietf #dodging

2025.11.23: NSA and IETF, part 2: Corruption continues. #pqcrypto #hybrids #nsa #ietf #corruption

2025.10.05: MODPOD: The collapse of IETF's protections for dissent. #ietf #objections #censorship #hybrids

2025.10.04: NSA and IETF: Can an attacker simply purchase standardization of weakened cryptography? #pqcrypto #hybrids #nsa #ietf #antitrust

2025.09.30: Surreptitious surveillance: On the importance of not being seen. #marketing #stealth #nsa

2025.04.23: McEliece standardization: Looking at what's happening, and analyzing rationales. #nist #iso #deployment #performance #security

2025.01.18: As expensive as a plane flight: Looking at some claims that quantum computers won't work. #quantum #energy #variables #errors #rsa #secrecy

2024.10.28: The sins of the 90s: Questioning a puzzling claim about mass surveillance. #attackers #governments #corporations #surveillance #cryptowars

2024.08.03: Clang vs. Clang: You're making Clang angry. You wouldn't like Clang when it's angry. #compilers #optimization #bugs #timing #security #codescans

2024.06.12: Bibliography keys: It's as easy as [1], [2], [3]. #bibliographies #citations #bibtex #votemanipulation #paperwriting

2024.01.02: Double encryption: Analyzing the NSA/GCHQ arguments against hybrids. #nsa #quantification #risks #complexity #costs

2023.11.25: Another way to botch the security analysis of Kyber-512: Responding to a recent blog post. #nist #uncertainty #errorbars #quantification

2023.10.23: Reducing "gate" counts for Kyber-512: Two algorithm analyses, from first principles, contradicting NIST's calculation. #xor #popcount #gates #memory #clumping

2023.10.03: The inability to count correctly: Debunking NIST's calculation of the Kyber-512 security level. #nist #addition #multiplication #ntru #kyber #fiasco

2023.06.09: Turbo Boost: How to perpetuate security problems. #overclocking #performancehype #power #timing #hertzbleed #riskmanagement #environment

2022.08.05: NSA, NIST, and post-quantum cryptography: Announcing my second lawsuit against the U.S. government. #nsa #nist #des #dsa #dualec #sigintenablingproject #nistpqc #foia

2022.01.29: Plagiarism as a patent amplifier: Understanding the delayed rollout of post-quantum cryptography. #pqcrypto #patents #ntru #lpr #ding #peikert #newhope

2020.12.06: Optimizing for the wrong metric, part 1: Microsoft Word: Review of "An Efficiency Comparison of Document Preparation Systems Used in Academic Research and Development" by Knauff and Nejasmic. #latex #word #efficiency #metrics

2019.10.24: Why EdDSA held up better than ECDSA against Minerva: Cryptosystem designers successfully predicting, and protecting against, implementation failures. #ecdsa #eddsa #hnp #lwe #bleichenbacher #bkw

2019.04.30: An introduction to vectorization: Understanding one of the most important changes in the high-speed-software ecosystem. #vectorization #sse #avx #avx512 #antivectors

2017.11.05: Reconstructing ROCA: A case study of how quickly an attack can be developed from a limited disclosure. #infineon #roca #rsa

2017.10.17: Quantum algorithms to find collisions: Analysis of several algorithms for the collision problem, and for the related multi-target preimage problem. #collision #preimage #pqcrypto

2017.07.23: Fast-key-erasure random-number generators: An effort to clean up several messes simultaneously. #rng #forwardsecrecy #urandom #cascade #hmac #rekeying #proofs

2017.07.19: Benchmarking post-quantum cryptography: News regarding the SUPERCOP benchmarking system, and more recommendations to NIST. #benchmarking #supercop #nist #pqcrypto

2016.10.30: Some challenges in post-quantum standardization: My comments to NIST on the first draft of their call for submissions. #standardization #nist #pqcrypto

2016.06.07: The death of due process: A few notes on technology-fueled normalization of lynch mobs targeting both the accuser and the accused. #ethics #crime #punishment

2016.05.16: Security fraud in Europe's "Quantum Manifesto": How quantum cryptographers are stealing a quarter of a billion Euros from the European Commission. #qkd #quantumcrypto #quantummanifesto

2016.03.15: Thomas Jefferson and Apple versus the FBI: Can the government censor how-to books? What if some of the readers are criminals? What if the books can be understood by a computer? An introduction to freedom of speech for software publishers. #censorship #firstamendment #instructions #software #encryption

2015.11.20: Break a dozen secret keys, get a million more for free: Batch attacks are often much more cost-effective than single-target attacks. #batching #economics #keysizes #aes #ecc #rsa #dh #logjam

2015.03.14: The death of optimizing compilers: Abstract of my tutorial at ETAPS 2015. #etaps #compilers #cpuevolution #hotspots #optimization #domainspecific #returnofthejedi

2015.02.18: Follow-You Printing: How Equitrac's marketing department misrepresents and interferes with your work. #equitrac #followyouprinting #dilbert #officespaceprinter

2014.06.02: The Saber cluster: How we built a cluster capable of computing 3000000000000000000000 multiplications per year for just 50000 EUR. #nvidia #linux #howto

2014.05.17: Some small suggestions for the Intel instruction set: Low-cost changes to CPU architecture would make cryptography much safer and much faster. #constanttimecommitment #vmul53 #vcarry #pipelinedocumentation

2014.04.11: NIST's cryptographic standardization process: The first step towards improvement is to admit previous failures. #standardization #nist #des #dsa #dualec #nsa

2014.03.23: How to design an elliptic-curve signature system: There are many choices of elliptic-curve signature systems. The standard choice, ECDSA, is reasonable if you don't care about simplicity, speed, and security. #signatures #ecc #elgamal #schnorr #ecdsa #eddsa #ed25519

2014.02.13: A subfield-logarithm attack against ideal lattices: Computational algebraic number theory tackles lattice-based cryptography.

2014.02.05: Entropy Attacks! The conventional wisdom says that hash outputs can't be controlled; the conventional wisdom is simply wrong.

2014.05.17: Some small suggestions for the Intel instruction set: Low-cost changes to CPU architecture would make cryptography much safer and much faster. #constanttimecommitment #vmul53 #vcarry #pipelinedocumentation

Programmers trying to make crypto run fast often say things like "Why can't the CPU designer just add a 128-bit multiplication instruction?" Sometimes these questions turn into academic papers analyzing the cycle counts that would be obtained from various instruction-set extensions. What's missing from most of these questions and papers is the CPU designer's perspective: the new instructions cost chip area, and are competing with many other suggestions for productive ways to use the same chip area.

Intel has been willing to spend small amounts of chip area on instruction-set extensions that provide a sufficiently large benefit to cryptography: consider MULX, ADCX, ADOX, PCLMULQDQ, and the AES instructions. I have a few small suggestions for further tweaks that would cost very little chip area and that would have big benefits for crypto. Similar suggestions apply to other chip manufacturers.

Promise input protection. Hey, Intel, remember all the circuitry that you've devoted to making sure that unprivileged processes can't read secret cryptographic keys out of OS kernel memory and other users' processes? Do you realize that the same secrets are being exposed through cache-timing attacks, branch-timing attacks, etc.? Do you enjoy seeing quotes such as "The researchers suggest that Bitcoin users ... refrain from using a computer equipped with an Intel processor" in a March 2014 news report with a title of "Cryptology Attack Shows Bitcoin and OpenSSL Weakness"?

Apparently you do realize that there's a problem here: your AES-NI white paper discusses timing attacks in detail and says "The AES instructions are designed to mitigate all of the known timing and cache side channel leakage of sensitive data (from Ring 3 spy processes)." But this didn't do anything to stop the Bitcoin attack or the Lucky Thirteen timing attack against TLS. There's much more to cryptography than just AES, and you're not going to embed all that cryptography into your chips in the foreseeable future.

I'm a coauthor of a new cryptographic library, NaCl, that systematically avoids leaking secret data into load addresses, store addresses, and branch conditions. But this library is relying critically on observations of CPU behavior. We would much rather rely on promises made by the CPU manufacturer.

This brings me to my first suggestion: Please specify that various instructions keep various inputs secret, avoiding all data flow from those inputs to instruction timing, cache addresses, the branch unit, etc. Right now this isn't part of the specified architecture; it's a microarchitectural decision that could change in the next CPU.

For example, CMOV seems to keep its condition bit secret on various Intel CPUs today. Are you willing to commit to this for CMOV from a register? What about CMOV from memory? Or do you want to keep open the option of having CMOV from memory look at its condition bit and skip the load? What about multiplications: are you willing to commit to multiplication taking constant time, or do you want the option of PowerPC-style early aborts for multiplications by integers with many top 0 bits? Surely you're at least willing to promise secrecy for vector operations within registers, and for vector loads and stores; we can build safe cryptographic software from those operations even if you aren't willing to commit to anything else.

This tweak to the documented instruction set doesn't cost any chip area. The worst case is that today you promise secrecy for, e.g., the MUL inputs, and then realize in several years that you really want to make a faster variable-time multiplier; the commitment would then force you to add a new MULABORT instruction rather than violating the secrecy of the MUL instruction. Of course, if you don't make any commitments, then programmers have no choice but to rely on observations of CPU behavior; many of those programmers will assume constant-time MUL, and if you switch to variable-time MUL then you will be breaking cryptographic security.

Expose the 53-bit multiplier. Chitchanok Chuengsatiansup, Tanja Lange, Peter Schwabe, and I have set speed records for high-security public-key cryptography on Sandy Bridge etc. using Intel's impressive vectorized floating-point units. What's crazy about this is that we're limited to multiplying integers of about 25 bits, putting roughly 3/4 of the multiplier area to waste. Why? Because the multiplier insists on rounding its low output bits rather than giving them back to the programmer. Why? Because the multiplier is buried inside the floating-point unit.

So please include a vectorized 53-bit integer multiplier. Here's a reasonable design: VMUL53 r0,r1,r2,r3 computes the product of r0 and r1, each between −(2⁵³−1) and 2⁵³−1, and returns the product as r2+2⁵³*r3. Each of r0,r1,r2,r3 is stored as a signed 64-bit integer. Of course, the 128-bit vector instruction would compute 2 products in parallel, and the 256-bit vector instruction would compute 4 products in parallel. A 53-bit multiplier producing a full 106-bit product won't be much larger than a 53-bit multiplier producing a correctly rounded 53-bit floating-point result.

I realize that this is likely to need two execution uops to handle the data flow (first write r2, then write r3). Surely each of your uops will still have enough inputs and outputs to handle multiply-accumulate, so with two uops you should be able to handle an accumulating version VMAD53, adding separately to each of r2 and r3.

Allow fast carries. Please include a vectorized instruction VCARRY imm,r0,r1 that reads registers r0,r1, computes r0-2^immfloor(r0/2^imm+1/2),r1+floor(r0/2^imm+1/2), and writes r0,r1. Most important is VCARRYQ working on signed 64-bit lanes; second is VCARRYD working on signed 32-bit lanes.

Most of the hardware cost of this instruction is the cost of the barrel shifter that you already have for VPSRLQ and VPSRLD. With slight extra effort you can also support VUCARRY, an unsigned version that uses floor(...) instead of floor(...+1/2), but if I had to pick just one I'd pick VCARRY rather than VUCARRY.

Right now simulating VUCARRY takes three instructions (shift, add, mask) using an extra register and an extra constant (the mask); simulating VCARRY is even more expensive. What we do today (again, to take advantage of the huge chip area devoted to floating-point arithmetic) is simulate a floating-point version of VCARRY, but with VMUL53 this won't be necessary.

Document the damn pipelines. These days I spend more time optimizing code for ARM chips than for Intel chips. The real reason for this isn't any sort of assessment of current or future importance. The real reason is that ARM publishes the detailed pipeline documentation that I need, so squeezing out every last cycle for ARM is fun, while Intel hides the pipeline documentation, so squeezing out every last cycle for Intel is painful. I have no idea what Intel thinks it's accomplishing by hiding this information; what Intel is actually accomplishing is giving its competitors a performance boost.

Please document your pipelines properly. Okay, okay, I admit that this isn't a change to the instruction set, but it's similar to my other suggestions in that it's something you could do to drastically improve crypto performance for your chips without a serious sacrifice in chip area.

The cr.yp.to blog

2014.05.17: Some small suggestions for the Intel instruction set: Low-cost changes to CPU architecture would make cryptography much safer and much faster. #constanttimecommitment #vmul53 #vcarry #pipelinedocumentation