At the 2001 IEEE Conference on Computer Vision and Pattern Recognition, Paul Viola and Michael Jones presented “Rapid Object Detection using a Boosted Cascade of Simple Features,” the first method to detect faces in images quickly enough to run in real time on ordinary hardware. It combined three ideas that, together, made the problem tractable.
The first was the “integral image,” a precomputed representation that lets the detector evaluate simple rectangular light-and-dark patterns (Haar-like features) in constant time anywhere in the picture. The second was a use of AdaBoost, a boosting algorithm, to select a small number of the most informative features out of a vast pool and combine them into a strong classifier. The third was an “attentional cascade”: a chain of progressively more demanding classifiers, where the early, cheap stages reject the overwhelmingly empty background regions immediately, so expensive computation is spent only on promising areas.
The result was fast enough to scan a video frame many times per second, and it spread quickly into consumer products. The face-detection box that appears around a subject in a digital camera or phone viewfinder, the basis of autofocus and auto-exposure on faces, descends directly from this work. Viola-Jones remained the default face detector for roughly a decade until deep learning approaches surpassed it, and it stands as a model of how clever engineering can turn a slow research idea into something that runs everywhere.