Fault Tolerance

Fault tolerance is the property of a system that lets it keep doing its job even when some of its parts break. In a distributed system, components fail constantly: disks die, machines reboot, network links drop packets, and entire data centers lose power. A fault-tolerant design assumes these events are normal rather than exceptional, and arranges for the system as a whole to survive them.

The basic tool is redundancy. If one copy of a component can fail, you keep more than one, so the loss of any single copy does not take down the service. This shows up as replicated data across multiple machines, multiple servers behind a load balancer, and standby nodes ready to take over. Amazon’s Builders’ Library article on static stability describes a strong version of this idea: a service is provisioned across multiple Availability Zones with enough spare capacity that, in the words of authors Becky Weiss and Mike Furr, “even if an entire Availability Zone were impaired, the servers in the remaining Availability Zones could carry the load.”

A key insight in that article is that the response to a failure should require as little new action as possible. A “statically stable” service is designed to “continue operating correctly without having to launch any new EC2 instances, even if an Availability Zone were to become impaired.” The reasoning is that the moment of failure is exactly when dependencies are most likely to be unavailable, so a recovery plan that depends on those dependencies can fail precisely when it is needed.

Beyond redundancy and failover, fault tolerance includes graceful degradation: when full service cannot be maintained, the system sheds non-essential work and keeps its core function alive rather than collapsing entirely. Taken together, these techniques make fault tolerance the central goal of distributed-systems design, since the whole reason to spread work across many machines is to outlive the failure of any one of them.