The Marginal Value of Adaptive Gradient Methods in Machine Learning

“The Marginal Value of Adaptive Gradient Methods in Machine Learning” was submitted to arXiv on May 23, 2017 by Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. It pushed back against the field’s growing reflex of reaching for adaptive optimizers like Adam by default, and it framed a debate that is still alive.

Adaptive methods such as AdaGrad, RMSprop, and Adam give each parameter its own learning rate based on the history of its gradients, and they often drive the training loss down quickly. The authors asked whether that head start translates into better performance on unseen data. Their analysis of overparameterized problems showed that adaptive methods can converge to solutions quite different from those found by plain gradient descent or stochastic gradient descent, and that these different solutions frequently generalize worse, doing fine on the training set but less well on the test set.

The paper’s recommendation was deliberately provocative: practitioners should not assume adaptive methods are strictly better, and tuned SGD with momentum often deserves to be the baseline. This finding helped motivate later work, including AdamW’s decoupled weight decay, aimed at recovering good generalization from adaptive optimizers.

For a general reader, the paper is a cautionary tale about convenient defaults: a tool that makes the visible metric, training speed, look great can quietly cost you on the metric that actually matters, performance in the real world.

Sources

Last verified June 7, 2026