“Deep Residual Learning for Image Recognition” was submitted to arXiv in December 2015 by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun of Microsoft Research. The architecture it introduced, ResNet, let neural networks become far deeper than before and won the ImageNet competition that year.
The problem it solved was counterintuitive. By 2015 researchers expected deeper networks to be more powerful, yet in practice, beyond a certain depth, adding layers made networks harder to train and actually worse - a degradation problem, not simply overfitting. The authors’ fix was the residual connection, also called a skip connection. Instead of asking each block of layers to learn a full transformation, they asked it to learn only the change, or residual, relative to its input, and added a shortcut that carries the input straight through. This makes it easy for a layer to do nothing if nothing is needed, which keeps very deep networks trainable.
The result was dramatic. The authors trained networks with over a hundred layers, and even one with over a thousand, and these deep ResNets set new records on ImageNet and related benchmarks. The skip connection turned out to be a general principle, not a vision-only trick: it appears in the Transformer and in essentially every large model trained since, because it keeps gradients flowing cleanly through deep stacks of layers.
There is little to mark against the paper - its central idea proved both simple and durable. If anything, ResNet is a reminder that a small architectural insight, well placed, can matter as much as scale: a single shortcut connection removed a barrier that had been holding the whole field back.