The cr.yp.to blog



2024.08.03: Clang vs. Clang: You're making Clang angry. You wouldn't like Clang when it's angry. #compilers #optimization #bugs #timing #security #codescans

This is a blog post about an experiment with Clang. I need to explain some context first.

Compiler "optimizations". Try skimming through recent changes to LLVM and GCC. You'll find "optimizations", and tests for "optimizations", and fixes to tests for "optimizations", and fixes to bugs in "optimizations".

The bugs admitted in the compiler changelogs are just the tip of the iceberg. Whenever possible, compiler writers refuse to take responsibility for the bugs they introduced, even though the compiled code worked fine before the "optimizations". [2024.08.03 edit: Added more links here.] The excuse for not taking responsibility is that there are "language standards" saying that these bugs should be blamed on millions of programmers writing code that bumps into "undefined behavior", rather than being blamed on the much smaller group of compiler writers subsequently changing how this code behaves. These "language standards" are written by the compiler writers.

Evidently the compiler writers find it more important to continue developing "optimizations" than to have computer systems functioning as expected. Developing "optimizations" seems to be a very large part of what compiler writers are paid to do.

I'm putting "optimizations" in quotes because compiler "optimizations" are generally nowhere near the performance that competent programmers can achieve. As a cryptographic example, benchmarks across many CPUs show that the avx2 implementation of kyber768 is about 4 times faster than portable code compiled with an "optimizing" compiler. There are many more examples like this.

Compiler writers measure an "optimization" as successful if they can find any example where the "optimization" saves time. Does this matter for the overall user experience? The typical debate runs as follows:

But both sides of this debate are founded upon an invalid measurement methodology. The actual speedup produced by compilers is smaller, and shrinking, as explained in my talk "The death of optimizing compilers" in 2015.

If you look at the hot spots in a software system, the code running often enough for the user to care about performance, then you find tons of intrinsics and assembly language. There are 160000 lines of assembly (.asm and .S files) in FFmpeg, for example. Faster computers and faster networks are handling more and more data (e.g., bigger videos for FFmpeg), putting more and more load on the hot spots. Benchmarks selected to show the effect of compiler "optimizations" are not representative of how CPU time is actually being spent.

Meanwhile there are more and more bugs produced by these "optimizations", and one has to ask how many gazillions of dollars have been lost because of that. Consider, for example, security. Deloitte reported that 2023 IT security budgets were half a percent of corporate revenue, which sounds like hundreds of billions of dollars overall given that total corporate revenues worldwide were above $48 trillion in 2022. (Some caveats: perhaps Deloitte's half percent was an unweighted average over corporations; not all corporations respond to surveys.) It would be interesting to study what percentage of security failures can be partly or entirely attributed to compiler "optimizations".

Timing leakage. The security problems caused by "optimizing" compilers aren't just traditional bugs, but also unintentional leakage of secret information into timings, often allowing those secrets to be reconstructed by timing attacks. To quote a EuroS&P 2018 paper by Laurent Simon, David Chisnall, and Ross Anderson: "A compiler upgrade can suddenly and without warning open a timing channel in previously secure code. This arms race is pointless and has to stop."

Is there actually an arms race? Let's look at the evidence.

The example highlighted in the 2018 paper used a bool to select between two values. Obviously bool was triggering the compiler to create conditional jumps. The paper continued by acknowledging common practice of eliminating bool:

So an extra layer of obfuscation used by cryptographers is to eradicate bool completely in critical code; and to have specially-crafted functions to compare integers in constant time too. OpenSSL currently declares 37 different functions to support this. Unfortunately, compilers offer no guarantees to such code; the next version of the same compiler may silently understand it and optimize the constant-timeness away. Examples of such failures include the carefully-crafted constant-time implementation of curve25519 which was broken by Microsoft’s compiler in 2015 [30].

The one example claimed at the end of this quote is a misunderstanding triggered by the title of the cited 2015 paper, namely "When constant-time source yields variable-time binary: Exploiting curve25519-donna built with MSVC 2015". What was actually happening in the 2015 paper was that the int64 operations in curve25519-donna were, when compiled for 32-bit x86, converted into calls to Microsoft's 32-bit int64 library, specifically llmul.asm, where Microsoft had made the mistake of using data-dependent branches. Any reasonable concept of source code should include this variable-time library: that's where the timing leak was created, and it's where the timing leak should be fixed.

Later examples in an S&P 2020 paper also used bool. So, hmmm, could it be that avoiding secret comparisons and secret bool in source code stops compilers from producing secret conditional branches? (The reason for listing comparisons separately here is that, technically, C comparisons produce int rather than bool, even if the compiler thinks of them as producing bool internally.)

Unfortunately, no, there really is an arms race here. Never underestimate the ability of "optimizing" compilers to screw things up.

In June 2024, Antoon Purnal reported a successful timing attack against the Kyber reference code compiled with some of the "optimization" options for Clang 15 (released in 2022) or newer. The reference code had a computation of the form (-((x>>j)&1))&y, which is y if bit j of x is set, else 0. See the problem? CLANG SMASH! The compiler used a bit-test instruction to convert bit j of x into a bool, and then performed a conditional branch based on this bool.

(As a side note, I would expect this conditional branch to slow down more code than it speeds up. But remember that compiler writers measure an "optimization" as successful if they can find any example where the "optimization" saves time.)

Inside LLVM, this "optimization" is handled by combineShiftAnd1ToBitTest in lib/CodeGen/SelectionDAG/DAGCombiner.cpp. The function combineShiftAnd1ToBitTest was added by Sanjay Patel in September 2019, and was tweaked by various people later. I wonder whether anyone has found earlier examples of compiler "optimizations" crossing the line into introducing bool.

A subsequent compiler patch crossing the same line was a GCC patch by ARM in November 2021 to convert (-x)>>31 into -(x>0). I issued a warning about this in April 2024.

TIMECOP 2, which is built into the SUPERCOP cryptographic test framework, automatically scans compiled code for conditional branches derived from secrets, if the code was declared to be constant-time. (It also scans code for array indices derived from secrets, and the KyberSlash paper describes a patch to scan for divisions derived from secrets.) Checking TIMECOP results is what led to my April 2024 warning.

The original TIMECOP 1 from Moritz Neikes was also a modification to SUPERCOP, automating an approach taken by Adam Langley's ctgrind. Some differences between TIMECOP 1 and TIMECOP 2: TIMECOP 2 automatically marks RNG output as secret; TIMECOP 2 supports "declassification"; TIMECOP 2 supports designation of "public inputs"; TIMECOP 2 runs on multiple cores. (This last difference is also a contribution from the KyberSlash paper.)

TIMECOP has limitations: it supports only instructions supported by Valgrind (e.g., it gives up on AMD XOP instructions), and the data flow that it checks is only the data flow visible in the test runs that it carries out. There's continued work on tools for checking constant-time behavior. But I'm happy to report that the equivalent of TIMECOP is now built into the test suite for libmceliece, and I hope this spreads to other libraries.

If you've identified a variable-time code snippet, how do you rewrite it to run in constant time, while making sure that the rewrite doesn't introduce any bugs? I gave a talk about this in July 2024. Part of the talk was explaining some constant-time functions provided by libmceliece and SUPERCOP; these functions are provided by files crypto_{int,uint}{8,16,32,64}.h that you're free to copy into your own projects. As one example, the function crypto_uint32_bitmod_mask(x,j) has the same effect as -((x>>(j&31))&1), but stops the compiler from seeing that there's a 1-bit result. A fancier example is crypto_uint32_max(x,y).

For comparison, the 2018 paper reported a tweak to Clang/LLVM to add language support for a constant-time function __builtin_ct_choose(bool cond, x, y). The 2018 paper also incorrectly suggested that this was the only such function needed. Maybe this function will get into compilers someday, but clearly it'll be a long time before you can rely on this function being present for your projects, and the way it's implemented strikes me as more fragile than the way crypto_{int,uint}{8,16,32,64}.h are implemented.

Proactively avoiding problems. To the extent that timing leaks introduced by compilers are detected by pre-deployment test suites for compiled libraries, we can revert to an older compiler version for deployment while we're rewriting the code, so the users stay safe at each moment. But can we prevent compilers from introducing timing leaks in the first place?

One attractive answer is to distribute the libraries as assembly language. If your reaction is "Yikes, this makes software correctness hard to audit": The RWC 2024 talk "Adoption of high-assurance and highly performant cryptographic algorithms at AWS" presented fast X25519 software proven to correctly compute X25519 on all inputs. The software is written in assembly language (two versions targeting different 64-bit Intel/AMD CPUs, and two versions targeting different 64-bit ARM CPUs); the correctness statement is a theorem about the machine code, the same code that users are running; the proof is verified by the HOL Light theorem prover.

However, the argument against assembly is valid for cryptographic software that hasn't reached this gold standard yet. So I've also been looking at ways to rapidly introduce anti-timing-leak vaccinations into code written in C, C++, etc.

The shared feature of x&1 and x>>31 is that there are just two possibilities for the result: x&1 is 0 or 1; x>>31 is 0 or 1 if x is uint32; x>>31 is 0 or -1 if x is int32. (Side note: always compile with -fwrapv so that GCC and Clang assume twos-complement arithmetic.) In each case, someone writing a compiler "optimization" can easily say "Hey, I can stick that 1-bit result into a bool". There are more possibilities to think about (what about x&2? what about x<<31? what about 2-bit results?), but let's focus on these examples as case studies.

Simply scanning source code for &1, 1&, >>31, and so on finds many examples, but I've also tried a quick experiment with another type of scanning, which is what the title of this blog post is referring to. I'm not sure this second type is better overall, but it does seem to have some interesting capabilities.

I wrote a simple patch for the LLVM "optimizer" (starting from commit 68df06a0b2998765cb0a41353fcf0919bbf57ddb) to scan for &1 and >>31, and to issue remarks saying "please take this away before clang does something bad". Here's an example of compiling a test function with clang -Rpass-analysis=clang-vs-clang -O -c x.c after the patch:

    x.c:3:5: remark: clang-vs-clang: clang sees signed>>(bits-1); please take this away before clang does something bad [-Rpass-analysis=clang-vs-clang]
        3 |   x >>= 31;
          |     ^
    x.c:3:5: remark: clang-vs-clang: clang sees signed>>(bits-1); please take this away before clang does something bad [-Rpass-analysis=clang-vs-clang]

This is the test function:

    int sra31(int x)
    {
      x >>= 31;
      return x;
    }

Repetitions of the remark are unsurprising: compilers will keep trying to apply "optimizations" until they stop making progress.

The clang-vs-clang output distinguishes signed from unsigned for shifts. This distinction matters for a (manual or automatic) rewrite in terms of crypto_{int,uint}{8,16,32,64}.h. (One way to automate source transformations is via clang-tidy.) Of course, code omitted because of #ifdef, or otherwise eliminated before this "optimization" step, won't trigger any remarks from clang-vs-clang. [2024.08.03 edit: Fixed typo.]

I then ran SUPERCOP 20240716 (./data-do-biglittle on a dual EPYC 7742 with overclocking disabled), after adjusting SUPERCOP's compiler list to use clang-vs-clang (adding -Rpass-analysis=clang-vs-clang to the clang lines in okcompilers/{c,cpp}).

Results were ready three hours later. There were 675752 lines from Clang, in total 210786494 bytes, compressing to 3595199 bytes in 20240803-fromclang.txt.gz. There's quite a bit of noise in the output because various source-code branches on public data trigger Clang to internally generate &1 for the branch conditions. Skipping past those finds more interesting examples.

Here's an example that had also been found by simpler source-code scans and that is clearly good to proactively change:

    a0 += (a0>>15)&106;

Here's an example where a source-code scan would have required some effort at C parsing. The macro ONE8 is defined as ((uint8_t)1):

    *pk2^=(((*pk_cp)>>ir)&ONE8)<<jr;

Here's another example that's even harder to find by simpler scans:

    mask = signmask_x16(sub_x16(x,const_x16((q+1)/2)));

The macro signmask_x16(x) is defined as _mm256_srai_epi16((x),15), an AVX2 intrinsic to shift each signed 16-bit piece in a 256-bit vector right by 15 bits.

This last one isn't high on my priority list to rewrite. The easiest way I could imagine the vector operation turning into a conditional branch is to compile for AVX-512, which has vectorized bool, and to have the compiler decide for some strange reason to convert those into serial bool for a conditional branch. For the moment, given that TIMECOP uses Valgrind and that Valgrind doesn't support AVX-512, I don't recommend compiling for AVX-512 anyway.

The examples I found most interesting were 64-bit right-shifts of int128 triggering the >> warning. Sure, makes sense that the implementation of int128 is internally using a 63-bit right shift of the top 64-bit word to figure out the sign; but what happens if Clang adds GCC-like support for converting 63-bit right shifts into bool and then into conditional branches? Suddenly all sorts of int128 code will be variable-time, much like what the 2015 paper was claiming but this time really with no bool in the source. I think the easiest way to protect against this at the source level is to avoid the compiler's existing implementation of int128 in favor of some crypto_int128 functions. A side advantage of writing those functions is that crypto_int128, unlike the int128 in GCC and Clang, will work on small 32-bit platforms.

Beyond these scans, there are some other ideas I should mention. Adding support to GCC and Clang for secret data types sounds great, but I don't see how to make this robust given how GCC and Clang are structured. I have more hope for compilers that are built for security in the first place. Security-focused compilers that require new input languages, such as FaCT and the actively developed Jasmin, raise concerns about code-rewriting time, but, c'mon, is this really so scary? We have to be taking some sort of action anyway, given how compilers are handling current code. CLANG SMASH!


Version: This is version 2024.08.03 of the 20240803-clang.html web page.