Retry With Backoff

Retry with backoff is the discipline of retrying a failed network call sensibly rather than immediately and repeatedly. Many failures in distributed systems are transient: a brief overload, a dropped packet, a momentary blip. A retry will often succeed where the first attempt failed, which is why retrying is one of the oldest tools for building reliable systems on top of unreliable networks. The trouble is that a naive retry can make a bad situation far worse.

The first refinement is exponential backoff: instead of retrying at a fixed interval, the client waits longer after each successive failure, roughly doubling the delay each time. This reduces the pressure on a service that is already struggling, because clients ease off rather than pounding it at full rate. Backoff is paired with a cap on the maximum delay and a limit on the number of attempts, so a client does not wait forever or retry indefinitely against a dependency that is genuinely down.

Exponential backoff alone has a hidden flaw, which the AWS Architecture Blog article “Exponential Backoff And Jitter,” published by Marc Brooker on March 4, 2015, makes precise. If many clients fail at the same moment, perhaps because the server briefly went down, and they all back off by the same exponential schedule, they retry in synchronized waves. Each wave slams the recovering server at once and may knock it down again. The clients are politely backing off, yet still acting in lockstep.

The fix is jitter: adding randomness to the wait so that retries spread out in time instead of clustering. Brooker’s analysis compares plain exponential backoff against “Full Jitter,” which picks a random delay anywhere up to the current backoff bound, and “Decorrelated Jitter.” Both jittered approaches substantially reduce client work and server load. The article concludes that “the return on implementation complexity of using jittered backoff is huge, and it should be considered a standard approach for remote clients.”

Retry with backoff and jitter does not stand alone. It pairs with the circuit breaker, which decides when to stop retrying entirely so a client is not endlessly hammering a service that is clearly down. It also depends on idempotency, because a request that may be sent more than once must be safe to repeat: retrying a non-idempotent operation can double-charge a customer or duplicate a record. Together these patterns let a client recover from transient failures without amplifying them into an outage.

Sources

Related