cr.yp.to: 2019.04.30: An introduction to vectorization

Newer (Access-K): 2019.10.24: Why EdDSA held up better than ECDSA against Minerva: Cryptosystem designers successfully predicting, and protecting against, implementation failures. #ecdsa #eddsa #hnp #lwe #bleichenbacher #bkw

Older (Access-J): 2017.11.05: Reconstructing ROCA: A case study of how quickly an attack can be developed from a limited disclosure. #infineon #roca #rsa

Table of contents (Access-I for index page)

2025.04.23: McEliece standardization: Looking at what's happening, and analyzing rationales. #nist #iso #deployment #performance #security

2025.01.18: As expensive as a plane flight: Looking at some claims that quantum computers won't work. #quantum #energy #variables #errors #rsa #secrecy

2024.10.28: The sins of the 90s: Questioning a puzzling claim about mass surveillance. #attackers #governments #corporations #surveillance #cryptowars

2024.08.03: Clang vs. Clang: You're making Clang angry. You wouldn't like Clang when it's angry. #compilers #optimization #bugs #timing #security #codescans

2024.06.12: Bibliography keys: It's as easy as [1], [2], [3]. #bibliographies #citations #bibtex #votemanipulation #paperwriting

2024.01.02: Double encryption: Analyzing the NSA/GCHQ arguments against hybrids. #nsa #quantification #risks #complexity #costs

2023.11.25: Another way to botch the security analysis of Kyber-512: Responding to a recent blog post. #nist #uncertainty #errorbars #quantification

2023.10.23: Reducing "gate" counts for Kyber-512: Two algorithm analyses, from first principles, contradicting NIST's calculation. #xor #popcount #gates #memory #clumping

2023.10.03: The inability to count correctly: Debunking NIST's calculation of the Kyber-512 security level. #nist #addition #multiplication #ntru #kyber #fiasco

2023.06.09: Turbo Boost: How to perpetuate security problems. #overclocking #performancehype #power #timing #hertzbleed #riskmanagement #environment

2022.08.05: NSA, NIST, and post-quantum cryptography: Announcing my second lawsuit against the U.S. government. #nsa #nist #des #dsa #dualec #sigintenablingproject #nistpqc #foia

2022.01.29: Plagiarism as a patent amplifier: Understanding the delayed rollout of post-quantum cryptography. #pqcrypto #patents #ntru #lpr #ding #peikert #newhope

2020.12.06: Optimizing for the wrong metric, part 1: Microsoft Word: Review of "An Efficiency Comparison of Document Preparation Systems Used in Academic Research and Development" by Knauff and Nejasmic. #latex #word #efficiency #metrics

2019.10.24: Why EdDSA held up better than ECDSA against Minerva: Cryptosystem designers successfully predicting, and protecting against, implementation failures. #ecdsa #eddsa #hnp #lwe #bleichenbacher #bkw

2019.04.30: An introduction to vectorization: Understanding one of the most important changes in the high-speed-software ecosystem. #vectorization #sse #avx #avx512 #antivectors

2017.11.05: Reconstructing ROCA: A case study of how quickly an attack can be developed from a limited disclosure. #infineon #roca #rsa

2017.10.17: Quantum algorithms to find collisions: Analysis of several algorithms for the collision problem, and for the related multi-target preimage problem. #collision #preimage #pqcrypto

2017.07.23: Fast-key-erasure random-number generators: An effort to clean up several messes simultaneously. #rng #forwardsecrecy #urandom #cascade #hmac #rekeying #proofs

2017.07.19: Benchmarking post-quantum cryptography: News regarding the SUPERCOP benchmarking system, and more recommendations to NIST. #benchmarking #supercop #nist #pqcrypto

2016.10.30: Some challenges in post-quantum standardization: My comments to NIST on the first draft of their call for submissions. #standardization #nist #pqcrypto

2016.06.07: The death of due process: A few notes on technology-fueled normalization of lynch mobs targeting both the accuser and the accused. #ethics #crime #punishment

2016.05.16: Security fraud in Europe's "Quantum Manifesto": How quantum cryptographers are stealing a quarter of a billion Euros from the European Commission. #qkd #quantumcrypto #quantummanifesto

2016.03.15: Thomas Jefferson and Apple versus the FBI: Can the government censor how-to books? What if some of the readers are criminals? What if the books can be understood by a computer? An introduction to freedom of speech for software publishers. #censorship #firstamendment #instructions #software #encryption

2015.11.20: Break a dozen secret keys, get a million more for free: Batch attacks are often much more cost-effective than single-target attacks. #batching #economics #keysizes #aes #ecc #rsa #dh #logjam

2015.03.14: The death of optimizing compilers: Abstract of my tutorial at ETAPS 2015. #etaps #compilers #cpuevolution #hotspots #optimization #domainspecific #returnofthejedi

2015.02.18: Follow-You Printing: How Equitrac's marketing department misrepresents and interferes with your work. #equitrac #followyouprinting #dilbert #officespaceprinter

2014.06.02: The Saber cluster: How we built a cluster capable of computing 3000000000000000000000 multiplications per year for just 50000 EUR. #nvidia #linux #howto

2014.05.17: Some small suggestions for the Intel instruction set: Low-cost changes to CPU architecture would make cryptography much safer and much faster. #constanttimecommitment #vmul53 #vcarry #pipelinedocumentation

2014.04.11: NIST's cryptographic standardization process: The first step towards improvement is to admit previous failures. #standardization #nist #des #dsa #dualec #nsa

2014.03.23: How to design an elliptic-curve signature system: There are many choices of elliptic-curve signature systems. The standard choice, ECDSA, is reasonable if you don't care about simplicity, speed, and security. #signatures #ecc #elgamal #schnorr #ecdsa #eddsa #ed25519

2014.02.13: A subfield-logarithm attack against ideal lattices: Computational algebraic number theory tackles lattice-based cryptography.

2014.02.05: Entropy Attacks! The conventional wisdom says that hash outputs can't be controlled; the conventional wisdom is simply wrong.

2019.04.30: An introduction to vectorization: Understanding one of the most important changes in the high-speed-software ecosystem. #vectorization #sse #avx #avx512 #antivectors

Your CPU has a 32-bit addition instruction that computes the sum of two 32-bit integers x and y, producing a 32-bit result. (The 33rd bit of the sum, the "carry bit", is usually thrown away.)

A 2-way-vectorized 32-bit addition instruction computes the sum of two vectors (x₀,x₁) and (y₀,y₁), where x₀,x₁,y₀,y₁ are 32-bit integers. This means that the CPU computes the sum of x₀ and y₀, and also computes the sum of x₁ and y₁.

A 4-way-vectorized 32-bit addition instruction computes the sum of two 32-bit integers x₀ and y₀; the sum of two 32-bit integers x₁ and y₁; the sum of two 32-bit integers x₂ and y₂; and the sum of two 32-bit integers x₃ and y₃.

Why should a CPU designer bother providing a vectorized addition instruction? Why doesn't the programmer simply use one 32-bit addition instruction to add x₀ and y₀, another 32-bit addition instruction to add x₁ and y₁, et cetera? To understand the answer, let's take a closer look at how the CPU handles a non-vectorized 32-bit addition instruction:

These five stages involve tremendous overhead beyond the addition. Typically there are even more stages: for example, an early stage that inserts the instruction into an array of instructions ready to be executed, and a late stage that "retires" the completed instruction, removing it from this array. It's not surprising for an Intel CPU to have 15 or more stages overall. (Readers interested in learning more about the Intel pipeline should study Agner Fog's optimization manuals.)

Students in algorithms courses are usually trained to count arithmetic operations and to ignore the cost of memory access. Does the overhead of handling an instruction really matter compared to the cost of arithmetic? To see that the answer is yes, let's scale up to a much more expensive arithmetic instruction, namely a 64-bit floating-point multiplication. An Intel presentation in 2015 reported that a 64-bit floating-point multiplication costs 6.4 picojoules (at 22nm, scaling "well with process and voltage"), that reading 64 bits from a register file costs 1.2 picojoules (scaling "less well with voltage"), that reading 64 bits from a small (8KB) cache costs 4.2 picojoules, that reading 64 bits from a large (256KB) cache costs 16.7 picojoules, and that moving 64 bits through a wire 5 millimeters long costs 11.20 picojoules ("more difficult to scale down").

Now let's look at how the CPU handles a 4-way-vectorized 32-bit addition instruction:

The extra efficiency of the vectorized instruction is already clear at the first stage. Fetching a vectorized addition instruction might be slightly more expensive than fetching a non-vectorized addition instruction if the vectorized instruction has a longer encoding, but it certainly isn't 4 times as expensive. More broadly, 4-way vectorization means that the overhead of handling each instruction (and, to some extent, the overhead of handling each input and output) is amortized across 4 arithmetic operations. Moving from 4-way vectorization to 8-way vectorization chops the overhead in half again.

Computers are constantly applying the same computations to one input after another, continuing through large volumes of data. Handling these computations with vectorized instructions, rather than with non-vectorized instructions, increases the amount of useful work that the CPU can do in any particular amount of time under any particular limits on energy, power, temperature, etc. Commonly used optimizing compilers such as gcc automatically try to figure out how they can use vectorized instructions instead of non-vectorized instructions. Often the compiler's automatic vectorization doesn't succeed, so programmers manually vectorize critical inner loops in a broad range of performance-sensitive applications.

Wikipedia's AVX page mentions Blender (movie creation), dav1d (AV1 movie player used inside Firefox), TensorFlow (machine learning), and various other popular applications that use Intel's instructions to handle 256-bit vectors. Intel's 128-bit instructions, "SSE" etc., were introduced a decade earlier and are used so pervasively that trying to keep a list of applications would be silly.

There is overwhelming evidence of the huge performance increase produced by vectorization. This performance increase is also the reason that CPU designers include vector instructions in every big enough CPU. Some CPU designers have gone far beyond 4-way vectorization: massive vector processors called GPUs (typically working with 1024-bit vectors) have set speed records for a remarkable range of computations. Current GPUs aren't designed to run full operating systems, so they can't replace current CPUs, but the performance of GPUs illustrates the fundamental efficiency of vectorization.

It's possible for a high-speed vectorized computation to generate enough heat to overwhelm the CPU's cooling, or to consume more power than is available. The CPU then reduces its clock speed somewhat to compensate. (GPUs typically run below 2GHz.) CPU designers are perfectly aware of this, and continue to include vector instructions, because vectorization is still a huge win.

Software without full vectorization

Sometimes a CPU is busy running software that makes little use of the most powerful vector instructions provided by the CPU. There are two basic reasons for this.

Reason 1: Some computations are hard to vectorize. Consider, for example, the RC4 stream cipher. This cipher was designed in the 1980s for high software speed, and for many years it was the main cipher used to encrypt HTTPS traffic. In 2001 (as part of a statement downplaying the latest attacks against RC4), RSA Laboratories described RC4 as "extremely efficient". Today the Wikipedia page on RC4 describes RC4 as "remarkable for its simplicity and speed in software". But let's look at some benchmarks comparing RC4 to my ChaCha20 stream cipher:

Behind the scenes, these ChaCha20 speeds are taking advantage of 128-bit vector instructions on the Core 2 and 256-bit vector instructions on the Skylake, while it's awfully difficult for RC4 to make any use of vector instructions. One of the recurring themes in my research is exploring ways that non-vectorizable computations can be replaced with vectorizable computations.

Reason 2: Intel keeps changing its vector instruction sets. Intel started releasing

Each new instruction-set extension is a new hassle. Someone needs to figure out how to modify software to use the new instructions, how to deploy the new software for the new CPUs, and how to avoid having software crash if it tries using the new instructions on older CPUs that do not support those instructions. Wide use of the new instructions is years away. Intel often seems to have trouble figuring out which clock speeds are going to be safe for software using the new instructions. Eventually these problems are resolved for each new instruction set, but the resolution can take years.

Some vector instruction sets, such as ARM's Scalable Vector Extension (SVE), instead allow "vector-length agnostic" software that works with many different vector sizes. To process an N-bit vector, the software tells the CPU to

These instructions don't state C explicitly. There's one initial software upgrade to use these scalable vector instructions, and then subsequent changes in the vector length C don't need further upgrades. A new CPU with a larger C will automatically run exactly the same software using its larger vector lengths.

Perhaps ARM will make further changes to the instructions, for example to correct what turn out to be design flaws in the current details. But what matters is the scalable design. The CPU designer is free to try different vector lengths, without the hassle of introducing a new instruction set. The CPU designer can use all the previous vectorized software to test the new vector lengths. The user doesn't have to wait for software modifications to take advantage of the new CPU.

Vectorization denial: the anti-vectors campaign

Out of the 56 speakers at NIST's round-1 conference, there was one speaker making the amazing claim that vectorization is a bad thing. The slides say "We do not give a machine code implementation using SSE etc. We (and others) have found that using these extensions causes overall performance of cryptographic systems to slow down." (Emphasis in original. The speaker was Nigel Smart; the cover slide also lists Martin R. Albrecht, Yehuda Lindell, Emmanuela Orsini, Valery Osheter, Kenneth G. Paterson, and Guy Peer.)

What the anti-vectorization people have "found" is simply the well-documented fact that CPU designers sometimes have to reduce their clock speeds to handle high-speed vectorized computations. See, e.g., Intel's 2014 white paper "Optimizing Performance with Intel Advanced Vector Extensions", which discusses the big picture of interactions between clock speed and use of AVX.

Vectorized software is in heavy use in practically every smartphone CPU, laptop CPU, and server CPU. Vectorization is a huge win despite sometimes needing lower clock speeds. But these anti-vectorization people don't even want to use 128-bit vectors ("We do not give a machine code implementation using SSE etc."). They publish statements whose evident goal is to deter other people from writing and measuring vectorized code. Do they seriously believe that Intel and AMD and ARM have all screwed up by supporting vectorization?

As discussed above, Intel deployed a new instruction set starting in 2011 to handle 256-bit vectors. For some time after this, the most important computations on typical machines weren't using 256-bit vector instructions, simply because the necessary software hadn't been written yet. If a CPU needed to run at lower clock speed to handle an occasional 256-bit computation, then the most important computations would run more slowly, producing a measurable performance loss. But this was merely a temporary side effect of Intel's suboptimal management of the transition to larger vectors. Adding 256-bit support to the most important computations is a huge win—again, this is despite clock speeds sometimes being lower.

We're now seeing the same pattern repeat for Intel's new 512-bit vector instructions. Vlad Krasnov posted a measurement showing an occasional 512-bit computation producing an overall performance loss of 10% in a reasonable-sounding synthetic workload on an Intel Xeon Silver. The anti-vectors campaign views this measurement as a devastating indictment of 512-bit vectors; slides to the notion that one also shouldn't use 256-bit vectors; and slides beyond this to the notion that one shouldn't even use 128-bit vectors.

What will happen when the most important computations are upgraded to use 512-bit vectors? Unless Intel botched their hardware design, the answer will be a performance win, despite the reduction in clock speed. As this clock-speed reduction increasingly becomes the norm, the supposed harms of upgrading other computations to use 512-bit vectors will disappear, while the performance benefits will remain.

Because Intel never provided scalable vector instructions, the new ecosystem of software that can use Intel's 512-bit vectors is much smaller than the previous ecosystem of software that uses smaller vectors. Experience suggests that growth of the new ecosystem will produce a transition between two situations:

If you're a software developer, and you take the lead in adding support for 512-bit vector instructions (once you have a 512-bit CPU for testing), then you're giving extra choices to users who have 512-bit CPUs. Users then have the freedom to run your software in the mode that best supports their most important computations. Presumably this will be 512-bit mode in the not-too-distant future, and for some users it is already 512-bit mode today.

You might run into an anti-vectors campaigner who wants to take this freedom away, and who tries to make you feel guilty for clock-speed reductions. According to the campaigner, you are singlehandedly responsible for slowing down every other application on the machine! But the reality is quite different. A single change of clock speed allows larger vectors in many different computations, and if this includes the most important computations for the users then overall there's a speedup.

The cr.yp.to blog

2019.04.30: An introduction to vectorization: Understanding one of the most important changes in the high-speed-software ecosystem. #vectorization #sse #avx #avx512 #antivectors

Software without full vectorization

Vectorization denial: the anti-vectors campaign