Computer Vision

Computer vision is the field that gets computers to make sense of images and video: recognizing objects, faces, scenes, and actions, and ultimately describing what is happening in a picture. It is one of the oldest goals in AI, and one that was badly underestimated at the start.

The field’s optimism is captured by the 1966 Summer Vision Project at MIT, which proposed to make real progress on machine vision over a single summer. The reality was that seeing is extraordinarily hard. Through the following decades, researchers built theories of vision (David Marr’s influential 1982 book “Vision” framed it as a computational problem with distinct levels of analysis) and hand-engineered features, hand-designed rules for detecting edges, corners, and textures, that human experts tuned for each task.

The turning point was data. In 2009 Fei-Fei Li and colleagues introduced ImageNet, a labeled image database far larger than anything before it; the CVPR 2009 paper documents its scale and design. The annual ImageNet competition turned this into a public benchmark, and in 2012 a deep convolutional neural network called AlexNet won by a wide margin. That result convinced the field that learned features, deep networks trained on large data with GPUs, beat hand-crafted ones, and it set off the deep learning boom.

Today computer vision has merged into multimodal AI, where a single model handles images and text together, describing photos, answering questions about diagrams, or generating images from descriptions. For business readers, computer vision powers everything from medical imaging and quality inspection to self-checkout and document scanning, and its history shows the same pattern as the rest of AI: the data-driven approach eventually overtook decades of hand-built expertise.

Sources

Last verified June 6, 2026