“FaceNet: A Unified Embedding for Face Recognition and Clustering,” by Florian Schroff, Dmitry Kalenichenko, and James Philbin of Google, was published in 2015. Where DeepFace had treated face verification as a classification problem, FaceNet reframed it as learning a geometry: it trained a network to map each face directly into a compact space where distance means similarity.
The central mechanism is the triplet loss. Training operates on triplets of faces - an anchor image, another image of the same person (positive), and an image of a different person (negative). The loss pushes the anchor closer to the positive and farther from the negative, so that, after training, two photos of the same person land near each other and photos of different people land far apart. The authors paired this with an online “triplet mining” method that selects the hard, informative triplets during training. The result is a 128-number embedding for any face.
That design is powerful because the embedding is general-purpose. The same 128-dimensional vectors support verification (are these the same person?), recognition (who is this?), and clustering (group these unlabeled photos by identity) using ordinary distance calculations, with no need to retrain for each task. On the Labeled Faces in the Wild benchmark FaceNet reached 99.63% accuracy, and on the YouTube Faces database 95.12%, setting records on both.
FaceNet’s learned-embedding approach became the dominant template for modern face recognition and, more broadly, helped popularize the idea of training networks to produce useful embeddings - a pattern now central to search, recommendation, and retrieval systems far beyond faces.