“You Only Look Once: Unified, Real-Time Object Detection,” posted to arXiv in June 2015 by Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi, rethought how a neural network finds objects in an image. Earlier detectors such as R-CNN ran a classifier over thousands of candidate regions, which was accurate but slow. YOLO instead treats detection as a single regression problem solved in one forward pass of the network.
The model divides the image into a grid and, for each grid cell, predicts a handful of bounding boxes, their confidence, and class probabilities all at once. Because the whole picture is processed together in one shot - hence “you only look once” - the network reasons about global context and runs extremely fast. The base model processed images at 45 frames per second, and a smaller “Fast YOLO” reached 155 frames per second, fast enough for live video on modest hardware.
That speed made real-time detection practical for applications that could not wait for slower pipelines: robotics, surveillance, sports analytics, and later driver-assistance and autonomous systems. The original YOLO traded a little accuracy for its speed, especially on small or clustered objects, but successive versions narrowed the gap, and “YOLO” became a whole family of detectors. The paper marked the point where object detection became something you could simply run on a video stream rather than a heavyweight batch process.