k-Anonymity: A Model for Protecting Privacy

This 2002 paper by Latanya Sweeney, published in the International Journal on Uncertainty, Fuzziness and Knowledge-Based Systems, introduced k-anonymity, one of the first formal models for releasing data while protecting the people in it. Sweeney had earlier shown, famously, that most Americans could be uniquely identified from just their ZIP code, birth date, and sex, demonstrating that supposedly anonymized records could be re-linked to named individuals using publicly available data.

k-anonymity defines a concrete standard to guard against that. A released dataset is k-anonymous if the information for each person “cannot be distinguished from at least k-1 individuals whose information also appears in the release.” In practice this means that for the combination of quasi-identifiers, attributes like age, ZIP, and gender that are not unique on their own but can identify someone when combined, every record must match at least k-1 others. Data holders achieve this through generalization, for example replacing an exact age with a range, and suppression, removing especially revealing values, trading some detail for the guarantee that no record stands alone.

The model was hugely influential as the first widely adopted, checkable definition of de-identification, and it shaped how health and government data were released for years. It also has well-documented limits. Later work showed that k-anonymity alone does not prevent attribute disclosure when the matching group shares a sensitive value, and it offers no protection against an attacker with side information, which is part of what motivated the later, stronger guarantee of differential privacy.

For a business reader, k-anonymity is both a useful baseline for data sharing and a cautionary tale: it was a real advance, yet its weaknesses are the reason the field moved toward provable, attacker-agnostic guarantees.

k-Anonymity: A Model for Protecting Privacy

Sources

Related