Consistent Hashing

Consistent hashing was introduced by David Karger, Eric Lehman, Tom Leighton, Matthew Levine, Daniel Lewin, and Rina Panigrahy in their 1997 paper “Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web.” The paper describes a family of caching protocols for distributed networks that scale gracefully as the network grows, built on a special kind of hashing the authors call consistent hashing.

The problem it solves is rebalancing. With ordinary hashing, such as taking a key’s hash modulo the number of servers, changing the number of servers reshuffles almost every key to a new server. The Karger paper defines a consistent hash function as one that changes minimally as the range of the function changes. Roughly, both keys and servers are mapped onto the same abstract ring, and a key is assigned to the next server it meets going around the ring. When a server is added or removed, only the keys in its immediate neighborhood move; the rest stay put.

This property is exactly what large distributed datastores need to grow and shrink without disruption. Amazon’s Dynamo paper adopts a variant of consistent hashing as its core partitioning scheme: the output range of a hash function is treated as a ring, each node is assigned a position on it, and each data item’s key is hashed to find the node responsible for it. The Dynamo authors note that the principal advantage is that adding or removing a node only affects its immediate neighbors, leaving other nodes undisturbed.

Consistent hashing, often refined with virtual nodes to even out the load, underpins the data placement and rebalancing logic in systems such as Dynamo, Apache Cassandra, and many distributed caches. It is one of the quiet but essential ideas that make horizontally scalable storage practical.

Sources

Related