Labeled Faces in the Wild (LFW)

Labeled Faces in the Wild (LFW) is a face database released as University of Massachusetts Amherst Technical Report 07-49 in October 2007 by Gary Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. It became the benchmark on which a decade of face recognition progress was measured, including the deep-learning breakthroughs of DeepFace and FaceNet.

What made LFW different was the word “wild.” Earlier face databases were shot under controlled conditions - fixed lighting, neutral expressions, frontal poses. LFW instead gathered “previously existing” photographs from news articles on the web, capturing what the authors called natural variability in “pose, lighting, expression, background, race, ethnicity, age, gender, clothing, hairstyles, camera quality, color saturation, focus.” The goal was to study face recognition on the kind of uncontrolled images people actually encounter.

The numbers are specific: the database contains 13,233 face images of 5,749 different individuals. Of those people, 1,680 have two or more images and the remaining 4,069 appear just once. Faces are stored as 250-by-250 pixel JPEGs, and all of them are the output of the Viola-Jones face detector run over a large image set, with false detections and unidentifiable people removed by hand. The standard task is pair matching, or “face verification”: given two photos, decide whether they show the same person.

LFW’s design carried a built-in limitation that later mattered. Because the photos came from news coverage, the dataset skewed toward whoever appears in the news, leaving it demographically unbalanced - a flaw that fed into broader concerns about bias in face recognition. By the mid-2010s top systems exceeded 99% accuracy on LFW, effectively saturating the benchmark and pushing the field toward harder, larger, and more representative datasets.

Labeled Faces in the Wild (LFW)

Sources

Related