High Availability

High availability is the goal of keeping a system up and serving requests for as much of the time as possible, minimizing the windows during which it is unavailable. It is usually quantified as a percentage of uptime over some period, and discussed in terms of “nines”: three nines is 99.9 percent available, four nines is 99.99 percent, and so on, with each additional nine cutting the allowed downtime by a factor of ten.

Google’s Site Reliability Engineering book makes these numbers concrete in its availability table. At 99.9 percent availability a service may be down about 8.76 hours per year; at 99.99 percent that budget shrinks to roughly 52.6 minutes per year. The table is useful precisely because it translates an abstract reliability target into a tangible amount of acceptable downtime, which teams can then design and budget against.

Achieving these targets means eliminating single points of failure. Any component whose failure takes down the whole service, a lone database, a single load balancer, one network path, caps the system’s availability at that component’s reliability. High availability therefore relies on the same toolkit as fault tolerance: redundant components, replicated data, health checks, and automatic failover that detects a dead component and routes around it quickly enough that users barely notice.

High availability also lives in tension with consistency. Under the CAP theorem, when a network partition splits a distributed system, a design must choose between staying available and answering every request, or staying strictly consistent and refusing requests it cannot serve correctly. Pushing for more nines often means favoring availability and accepting weaker consistency guarantees, which is why high availability is a design tradeoff rather than a setting that can simply be turned up.

Sources

Related