Programmers trying to make crypto run fast often say things like "Why can't the CPU designer just add a 128-bit multiplication instruction?" Sometimes these questions turn into academic papers analyzing the cycle counts that would be obtained from various instruction-set extensions. What's missing from most of these questions and papers is the CPU designer's perspective: the new instructions cost chip area, and are competing with many other suggestions for productive ways to use the same chip area.
Intel has been willing to spend small amounts of chip area on instruction-set extensions that provide a sufficiently large benefit to cryptography: consider MULX, ADCX, ADOX, PCLMULQDQ, and the AES instructions. I have a few small suggestions for further tweaks that would cost very little chip area and that would have big benefits for crypto. Similar suggestions apply to other chip manufacturers.
Promise input protection. Hey, Intel, remember all the circuitry that you've devoted to making sure that unprivileged processes can't read secret cryptographic keys out of OS kernel memory and other users' processes? Do you realize that the same secrets are being exposed through cache-timing attacks, branch-timing attacks, etc.? Do you enjoy seeing quotes such as "The researchers suggest that Bitcoin users ... refrain from using a computer equipped with an Intel processor" in a March 2014 news report with a title of "Cryptology Attack Shows Bitcoin and OpenSSL Weakness"?
Apparently you do realize that there's a problem here: your AES-NI white paper discusses timing attacks in detail and says "The AES instructions are designed to mitigate all of the known timing and cache side channel leakage of sensitive data (from Ring 3 spy processes)." But this didn't do anything to stop the Bitcoin attack or the Lucky Thirteen timing attack against TLS. There's much more to cryptography than just AES, and you're not going to embed all that cryptography into your chips in the foreseeable future.
I'm a coauthor of a new cryptographic library, NaCl, that systematically avoids leaking secret data into load addresses, store addresses, and branch conditions. But this library is relying critically on observations of CPU behavior. We would much rather rely on promises made by the CPU manufacturer.
This brings me to my first suggestion: Please specify that various instructions keep various inputs secret, avoiding all data flow from those inputs to instruction timing, cache addresses, the branch unit, etc. Right now this isn't part of the specified architecture; it's a microarchitectural decision that could change in the next CPU.
For example, CMOV seems to keep its condition bit secret on various Intel CPUs today. Are you willing to commit to this for CMOV from a register? What about CMOV from memory? Or do you want to keep open the option of having CMOV from memory look at its condition bit and skip the load? What about multiplications: are you willing to commit to multiplication taking constant time, or do you want the option of PowerPC-style early aborts for multiplications by integers with many top 0 bits? Surely you're at least willing to promise secrecy for vector operations within registers, and for vector loads and stores; we can build safe cryptographic software from those operations even if you aren't willing to commit to anything else.
This tweak to the documented instruction set doesn't cost any chip area. The worst case is that today you promise secrecy for, e.g., the MUL inputs, and then realize in several years that you really want to make a faster variable-time multiplier; the commitment would then force you to add a new MULABORT instruction rather than violating the secrecy of the MUL instruction. Of course, if you don't make any commitments, then programmers have no choice but to rely on observations of CPU behavior; many of those programmers will assume constant-time MUL, and if you switch to variable-time MUL then you will be breaking cryptographic security.
Expose the 53-bit multiplier. Chitchanok Chuengsatiansup, Tanja Lange, Peter Schwabe, and I have set speed records for high-security public-key cryptography on Sandy Bridge etc. using Intel's impressive vectorized floating-point units. What's crazy about this is that we're limited to multiplying integers of about 25 bits, putting roughly 3/4 of the multiplier area to waste. Why? Because the multiplier insists on rounding its low output bits rather than giving them back to the programmer. Why? Because the multiplier is buried inside the floating-point unit.
So please include a vectorized 53-bit integer multiplier. Here's a reasonable design: VMUL53 r0,r1,r2,r3 computes the product of r0 and r1, each between −(253−1) and 253−1, and returns the product as r2+253*r3. Each of r0,r1,r2,r3 is stored as a signed 64-bit integer. Of course, the 128-bit vector instruction would compute 2 products in parallel, and the 256-bit vector instruction would compute 4 products in parallel. A 53-bit multiplier producing a full 106-bit product won't be much larger than a 53-bit multiplier producing a correctly rounded 53-bit floating-point result.
I realize that this is likely to need two execution uops to handle the data flow (first write r2, then write r3). Surely each of your uops will still have enough inputs and outputs to handle multiply-accumulate, so with two uops you should be able to handle an accumulating version VMAD53, adding separately to each of r2 and r3.
Allow fast carries. Please include a vectorized instruction VCARRY imm,r0,r1 that reads registers r0,r1, computes r0-2immfloor(r0/2imm+1/2),r1+floor(r0/2imm+1/2), and writes r0,r1. Most important is VCARRYQ working on signed 64-bit lanes; second is VCARRYD working on signed 32-bit lanes.
Most of the hardware cost of this instruction is the cost of the barrel shifter that you already have for VPSRLQ and VPSRLD. With slight extra effort you can also support VUCARRY, an unsigned version that uses floor(...) instead of floor(...+1/2), but if I had to pick just one I'd pick VCARRY rather than VUCARRY.
Right now simulating VUCARRY takes three instructions (shift, add, mask) using an extra register and an extra constant (the mask); simulating VCARRY is even more expensive. What we do today (again, to take advantage of the huge chip area devoted to floating-point arithmetic) is simulate a floating-point version of VCARRY, but with VMUL53 this won't be necessary.
Document the damn pipelines. These days I spend more time optimizing code for ARM chips than for Intel chips. The real reason for this isn't any sort of assessment of current or future importance. The real reason is that ARM publishes the detailed pipeline documentation that I need, so squeezing out every last cycle for ARM is fun, while Intel hides the pipeline documentation, so squeezing out every last cycle for Intel is painful. I have no idea what Intel thinks it's accomplishing by hiding this information; what Intel is actually accomplishing is giving its competitors a performance boost.
Please document your pipelines properly. Okay, okay, I admit that this isn't a change to the instruction set, but it's similar to my other suggestions in that it's something you could do to drastically improve crypto performance for your chips without a serious sacrifice in chip area.
[2022.01.09 update: Updated links above.]