Debugging `clang` with `rr`

A couple of months ago, I set out to debug a tricky issue that caused crashes in clang when compiling mp-units.

The bug manifests as a non-deterministic stack overflow and, sometimes, false diagnostics.

The problem originated from the unexpected interaction between two components: ASTContext::getAutoTypeInternal and llvm::FoldingSetBase::FindNodeOrInsertPos.

The smallest reproducer that triggers the bug looks like this:

template <typename>
concept C1 = true;

template <typename, auto>
concept C2 = true;

template <C1 auto V, C2<V> auto>
struct S;

When defining the template S, we use two non-type template parameters: V and an unnamed parameter (let’s call it X). These parameters are stored in the AutoTypes member of ASTContext, which was originally a llvm::ContextualFoldingSet. During this storing process, a FoldingSetID is generated using various pieces of information, including the value of a pointer (to a type, IIRC). This pointer can vary between runs due to ASLR, leading to different hash values and potentially placing X and V in the same bucket.

This situation wouldn’t be problematic if llvm::FoldingSetBase stored the FoldingSetID of each entry. However, it doesn’t. Instead, it recalculates the FoldingSetID each time it needs to compare entries. When the calculation involves an auto type, it triggers a recursive call to ASTContext::getAutoTypeInternal, which in turn calls (several frames after) llvm::FoldingSetBase::FindNodeOrInsertPos again. This recursive loop continues until it causes a stack overflow, crashing clang.

The tricky part of debugging this issue was its random nature, happening in only about 10% of the runs. Even with gdb attached, there was a chance of mis-stepping and causing the crash, requiring multiple reruns to catch another failure.

This is where rr came in handy. By running rr in a loop until a crash happened, I could consistently capture the failure. The loop looked something like this:

while true; do
    rr record ./llvm/cmake-linux-debug/bin/clang -std=c++20 crash.cpp -c
    if [ $? -ne 0 ]; then
        break;
    fi
    rr rm clang-0
done

Once I captured a crash, the execution was recorded! I could use rr replay clang-0 to replay the execution as many times as needed, with the same outcome each time.

Additionally, with commands like reverse-continue, even if I made a mistake and caused the crash, I could jump back in time to before the function call and continue debugging as if nothing had happened.

rr proved to be an invaluable tool, and I regret not discovering it sooner, especially considering it has been around for over a decade.