“Rich feature hierarchies for accurate object detection and semantic segmentation,” posted to arXiv in November 2013 by Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik of UC Berkeley, carried the deep-learning revolution from image classification into object detection. Where AlexNet had shown a convolutional network could say what is in an image, R-CNN tackled the harder problem of finding where multiple objects are and drawing boxes around them.
The system, named R-CNN for “Regions with CNN features,” works in stages. It first generates roughly two thousand candidate “region proposals” - boxes that might contain an object - using a separate bottom-up method. Each region is then warped to a fixed size and passed through a convolutional network to extract features, and a set of linear support vector machines classifies what, if anything, each region contains. Crucially, the authors showed that pretraining the network on the large ImageNet classification dataset and then fine-tuning it for detection worked far better than training from scratch on scarce labeled detection data.
R-CNN improved mean average precision on the PASCAL VOC 2012 benchmark to 53.3 percent, more than 30 percent better in relative terms than the previous best. It was slow - running a CNN on two thousand regions per image is expensive - and the authors and others soon produced faster successors (Fast R-CNN, Faster R-CNN) and single-pass detectors like YOLO. But R-CNN established the template for modern detection and demonstrated that learned features plus transfer learning beat hand-engineered pipelines decisively.