“Adaptive Subgradient Methods for Online Learning and Stochastic Optimization,” by John Duchi, Elad Hazan, and Yoram Singer, was published in the Journal of Machine Learning Research in 2011. It introduced AdaGrad, one of the first widely used optimizers to give every parameter its own learning rate, and it set the template that later methods like RMSprop and Adam refined.
The key idea is to adapt to the geometry of the data seen so far. AdaGrad accumulates, for each parameter, the sum of the squares of all its past gradients, then divides that parameter’s learning rate by the square root of this running total. Parameters that have received large or frequent gradients get their steps damped down, while parameters tied to rare features keep larger steps. This makes the method especially good at picking up “very predictive but rarely seen features,” a common situation in text and other sparse, high-dimensional data.
The paper also provided formal regret guarantees, showing the method performs provably as well as the best fixed proximal function chosen in hindsight, and demonstrated that these adaptive methods outperformed the non-adaptive subgradient algorithms that were state of the art at the time. AdaGrad’s one practical weakness, that its ever-growing accumulator can shrink learning rates toward zero, directly motivated the moving-average fixes in RMSprop and Adam.
For a general reader, AdaGrad illustrates a recurring theme in machine learning: giving an algorithm a bit of memory about its own past behavior, here the history of gradients, can make it markedly better at learning from messy real-world data.