Heisenbug

A heisenbug is a bug that appears to change its behavior, or disappear entirely, precisely when you attempt to study it. The name is a pun on the Heisenberg uncertainty principle: the act of observation disturbs the system. Attach a debugger, add a print statement, turn off compiler optimization, or simply run the program again, and the fault stops reproducing. The bug is real, but it hides from the very tools used to find it, which makes it one of the most frustrating categories of software defect.

The term enters the engineering literature in Jim Gray’s 1985 Tandem technical report “Why Do Computers Stop and What Can Be Done About It?” (Tandem TR 85.7). Gray, a database and transaction-processing researcher, was studying why production systems fail and how to keep them running. To organize the discussion he distinguished two kinds of software faults by their reproducibility, and his report is the primary source that popularized the Bohrbug and heisenbug terminology in fault-tolerance work.

A Bohrbug, named for the deterministic Bohr model of the atom, is a solid, repeatable fault. Given the same input and state, it fails the same way every time, so it can be reproduced on demand and tracked down with ordinary debugging. A heisenbug is the opposite: a transient fault that depends on subtle, hard-to-control conditions such as timing between concurrent tasks, the exact contents of memory, the order of events, or uninitialized state. Because those conditions shift from run to run, the bug is intermittent. Crucially, Gray observed that these transient faults often do not recur on retry: if you simply restart the failed operation, the precise timing or state that triggered it is unlikely to repeat, and the operation succeeds.

That observation has a profound practical consequence, and it was Gray’s real point. If most production faults are transient heisenbugs rather than solid Bohrbugs, then a system that can detect a failure and retry, or fail over to a fresh process, will mask a large fraction of faults without ever diagnosing them. Gray reported that in the failures he studied, the great majority behaved as heisenbugs that did not reproduce on retry. This is the engineering justification for techniques like process pairs, transactions, and automatic restart: you achieve high availability not by eliminating every bug but by surviving the transient ones.

Heisenbugs are especially common in concurrent and distributed software, where the relative timing of independent threads or machines is nondeterministic. A race condition is a classic source of heisenbugs, since it manifests only when operations interleave in a particular order, and adding instrumentation can perturb the timing enough to hide it (see race-condition). Optimization-sensitive bugs behave similarly: code that fails in an optimized release build may run cleanly in a debug build, because the compiler’s transformations changed the timing or memory layout that the fault depended on.

The lasting value of the Bohrbug and heisenbug distinction is that it reframes debugging strategy. For a Bohrbug, you reproduce and fix. For a heisenbug, reproduction is the hard part, so engineers turn to logging, deterministic replay, stress and soak testing, and tolerance mechanisms that contain the fault rather than rely on catching it live. Gray’s report remains the canonical reference, and the vocabulary it introduced is still in everyday use among practitioners discussing why a system stops and what can be done about it (see jim-gray).

Sources

Related