Adam: A Method for Stochastic Optimization

“Adam: A Method for Stochastic Optimization” was submitted to arXiv on December 22, 2014 by Diederik P. Kingma and Jimmy Ba. It introduced an optimization algorithm that, within a few years, became the default choice for training nearly every kind of deep neural network.

Training a neural network means repeatedly adjusting its parameters to reduce error, using the gradient (the direction of steepest improvement) computed by backpropagation. Plain stochastic gradient descent takes a fixed-size step in that direction. Adam instead keeps running estimates of two quantities for each parameter: the average recent gradient (the first moment) and the average recent squared gradient (the second moment). It uses these to give every parameter its own adaptive step size, taking larger steps for parameters with small, consistent gradients and smaller steps for noisy or large ones. The name is short for “adaptive moment estimation.”

The paper’s appeal was practical. Adam needs little memory, is computationally cheap, copes well with sparse gradients and non-stationary objectives, and works reasonably out of the box with its default settings, sparing practitioners much of the learning-rate tuning that earlier methods demanded. The authors also described AdaMax, a variant based on the infinity norm.

For a general reader, Adam is a good example of how a single well-chosen default can shape an entire field: by making training robust and nearly tuning-free, it removed a major source of friction and helped deep learning spread from specialists to a far wider set of practitioners.

Adam: A Method for Stochastic Optimization

Sources

Related