“On the Importance of Initialization and Momentum in Deep Learning,” by Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton, was presented at the International Conference on Machine Learning in 2013. It made the case that two unglamorous ingredients, where you start and how you accelerate, matter far more than the community had assumed.
Momentum modifies plain gradient descent by accumulating a velocity: instead of stepping purely in the current gradient’s direction, the optimizer keeps a running average of recent steps, letting it build speed along consistent directions and damp out oscillations across narrow valleys. The authors showed that deep and recurrent networks, long thought to need sophisticated second-order optimization, can in fact be trained well by stochastic gradient descent with momentum, including Nesterov’s accelerated variant, provided two conditions are met: a carefully chosen random initialization, and a schedule that slowly increases the momentum coefficient over training.
Crucially, both ingredients were necessary together. A network with poor initialization could not be rescued by momentum, and a well-initialized network badly underperformed without proper momentum tuning. The implication was that many earlier failures to train deep models had been blamed on the wrong culprit; the real issue was often initialization, not a fundamental limit of first-order methods.
For a general reader, the paper underscores a practical truth that runs through machine learning: the difference between a model that trains and one that does not often comes down to seemingly minor setup choices rather than the headline algorithm.