Replication (Distributed Data)

Replication means keeping a copy of the same data on more than one machine. It is the central technique of distributed data, because a single copy on a single machine is both a single point of failure and a bottleneck. Multiple copies let a system survive the loss of a node, serve reads from a machine near the user to lower latency, and spread read traffic across many machines to raise throughput.

The PostgreSQL documentation frames the core difficulty plainly: keeping copies in step is “the fundamental difficulty for servers working together,” and because no single approach fits every use case, there are many solutions. It distinguishes synchronous replication, where “a data-modifying transaction is not considered committed until all servers have committed the transaction,” from asynchronous replication, which allows “some delay between the time of a commit and its propagation to the other servers.”

There are three broad architectures. Single-leader (leader-follower) replication sends all writes to one leader that streams them to followers. MongoDB’s replica sets are an example: “one and only one member is deemed the primary node,” and “the secondaries replicate the primary’s oplog and apply the operations to their data sets.” Multi-leader replication accepts writes at more than one node. Leaderless replication, as in Cassandra, has no special write node at all; Cassandra “replicates every partition of data to many nodes across the cluster to maintain high availability and durability,” using a tunable replication factor.

The unavoidable cost of replication is that the copies are not always identical at the same instant. Synchronous schemes pay for consistency with latency and reduced write availability; asynchronous schemes accept that followers can briefly lag behind, which is the basis for weaker guarantees such as eventual consistency.