“Very Deep Convolutional Networks for Large-Scale Image Recognition” was submitted to arXiv in September 2014 by Karen Simonyan and Andrew Zisserman of the Visual Geometry Group at the University of Oxford, which is where the network’s nickname, VGG, comes from. It studied a single question: how much does sheer depth matter for image recognition?
The answer was, a great deal. The authors fixed almost everything except depth, building networks entirely from small 3 by 3 convolution filters stacked on top of each other, and pushed the number of weight layers to 16 and 19. The insight was that two stacked 3 by 3 layers cover the same receptive field as one 5 by 5 layer but with fewer parameters and an extra nonlinearity, so a deep stack of tiny filters is both cheaper and more expressive. The result was a clean, uniform architecture that took first and second place in the localization and classification tracks of the 2014 ImageNet challenge.
VGG’s lasting influence comes as much from its simplicity as its scores. Because the design is so regular and the pretrained weights were released openly, VGG became a default feature extractor and a building block for countless later systems, from object detectors to style-transfer demos. Its main drawback, a very large parameter count concentrated in the final fully connected layers, helped motivate the more efficient architectures that followed.
For a general reader, VGG is the moment the field internalized a simple lesson: depth, done with disciplined repetition of a small building block, is a lever worth pulling, a principle that carried straight into the residual networks and transformers that came after.