Bloom Filter

A Bloom filter is a compact, probabilistic data structure for answering one question: is this item a member of a set? It stores no keys, only a fixed array of bits. To add an item, several hash functions map it to several bit positions, and those bits are set to one. To test membership, the same hashes are computed and the corresponding bits are checked: if any is zero, the item is definitely not in the set; if all are one, the item is probably in the set. This asymmetry is the defining feature: a Bloom filter can produce false positives, but never false negatives.

The structure was introduced by Burton H. Bloom in his 1970 Communications of the ACM paper “Space/Time Trade-offs in Hash Coding with Allowable Errors.” Bloom’s insight was that if an application can tolerate a small, controllable rate of false matches, it can store set-membership information in far less space than a structure that holds the actual keys. The paper frames this explicitly as trading a tunable error frequency for reductions in space and reject time.

The cost of this compactness is that false positives accumulate as more items are added, because more bits get set to one. The false-positive rate depends on the size of the bit array, the number of hash functions, and the number of items stored, and these can be chosen in advance to hit a target error rate. Items also cannot be removed from a basic Bloom filter, since clearing a bit might erase information about other items that share it.

Bloom filters are valued precisely where a definite “no” is cheap and a definite “yes” is expensive to verify. Databases use them to skip disk reads for keys that are absent; caches and content systems use them to avoid pointless lookups; networking uses them to summarize large sets in small messages. In each case the filter is a fast first gate: a “no” ends the query immediately, while a “maybe” triggers the slower exact check.

Sources

Related