Fault Injection

Fault injection is an experimental method for studying how a system behaves when things go wrong by deliberately causing them to go wrong. Rather than waiting for faults to occur naturally, an engineer introduces faults on purpose, such as flipping a bit in memory, forcing a component to return an error, or cutting a network connection, and then observes whether the system’s built-in defenses detect the fault and recover. It is the standard way to test the parts of a system that, by design, are supposed to handle failures but are rarely exercised in normal operation.

The motivation is that error-handling code is the hardest code to trust. A 1997 NASA-published survey of fault injection techniques, “Fault Injection Techniques and Tools” by Hsueh, Tsai, and Iyer, explains why ordinary observation is insufficient: identifying failures in operational environments is difficult “due to the destructive nature of crashes and prolonged error latency,” so researchers “use an experiment-based approach for studying the dependability of a system.” To do that, the paper notes, practitioners need “specific instruments and tools to inject faults, create failures or errors, and monitor their effects.” Injecting the fault is what makes the recovery path observable.

Fault injection spans a wide range of techniques. Hardware fault injection physically perturbs a system, for example by forcing pin-level faults or exposing memory to radiation, to test fault-tolerant hardware. Software-implemented fault injection mutates a running program’s state, corrupting memory, registers, or messages to emulate the effect of a hardware fault or a bug. Higher up, injection can target interfaces and dependencies: making a remote call time out, return a malformed response, or fail intermittently to expose weaknesses such as missing timeouts or unhandled error cases.

Validation of safety-critical and fault-tolerant systems is the classic use. NASA and others used fault injection to validate distributed flight and spacecraft computers, gathering statistical data on whether faults were correctly detected, isolated, and recovered from, because a fault-tolerance claim that has never been tested under real injected faults is only a hope. The same logic applies to any system whose recovery code must work the first time it is ever truly needed.

Fault injection is also the technical root of chaos engineering, which applies the same idea, deliberately causing failures to learn how the system responds, to whole distributed systems running in production. Where classical fault injection often targets a component in a controlled lab, chaos engineering injects faults at the level of services and infrastructure under real traffic. In both cases the principle is identical: the only reliable way to know a system survives a fault is to make the fault happen and watch.

Sources

Related