“ImageNet Classification with Deep Convolutional Neural Networks” was presented at the NeurIPS conference in December 2012 by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton of the University of Toronto. The network it described, universally known as AlexNet, is widely treated as the spark of the modern deep learning era.
The paper entered the ImageNet Large Scale Visual Recognition Challenge, a contest to classify photographs into a thousand categories. AlexNet was a deep convolutional neural network with many more layers and parameters than earlier practical CNNs, and it won by a startling margin - cutting the error rate far below the best competing methods, most of which relied on hand-engineered features. The gap was large enough that the result was hard to dismiss as a fluke.
What was new was less a single idea than a combination that finally worked at scale. The authors trained on two GPUs, which made a network of this size feasible to train at all. They used the ReLU activation function to speed up learning, dropout to reduce overfitting, and data augmentation to stretch the training set. Together these choices showed the field that deep neural networks, given enough data and compute, could beat decades of carefully crafted computer vision pipelines.
The honest note is that AlexNet was an engineering and empirical breakthrough more than a theoretical one - most of its components existed beforehand. But the demonstration was decisive. Within a few years nearly all serious computer vision moved to deep learning, the major labs raced to hire neural network researchers, and the GPU vendor Nvidia found itself at the center of an industry it had not originally targeted.