Gossip Protocol

A gossip protocol is a way of spreading information through a cluster of machines by imitating how a rumor, or an epidemic, spreads through a population. Instead of routing every update through a central coordinator, each node periodically picks a few other nodes at random and exchanges what it knows with them. Over successive rounds, information reaches the whole cluster, with the number of informed nodes growing exponentially until everyone has converged.

The foundational treatment of this idea is the 1987 paper “Epidemic Algorithms for Replicated Database Maintenance” by Alan Demers and colleagues at Xerox PARC, presented at the ACM Principles of Distributed Computing conference. The paper analyzes randomized algorithms, including anti-entropy and rumor mongering, for distributing updates among the replicas of a database and driving them toward a consistent state. The authors borrow directly from the mathematics of epidemics to reason about how quickly and reliably updates propagate.

The appeal of gossip is robustness. Because each node only ever talks to a handful of peers and no node is special, the protocol keeps working even when individual machines fail or messages are lost; the redundancy of many independent exchanges routes information around the gaps. This makes gossip well suited to large clusters where central coordination would be a bottleneck or a single point of failure.

Modern distributed systems use gossip to manage cluster membership and failure detection. Apache Cassandra, whose design descends from Amazon’s Dynamo, uses a gossip protocol to propagate basic bootstrapping information such as endpoint membership and protocol versions: each node runs a periodic gossip task, updates its own heartbeat state, and exchanges that state with random peers, while a separate failure detector interprets the gossiped heartbeats to decide whether nodes are up or down.

Sources

Related