RMSprop

RMSprop is an optimization algorithm that adapts the learning rate for each parameter of a neural network. It is a standard tool for training, and its design fed directly into Adam, the optimizer that later became ubiquitous.

RMSprop has an unusual provenance: it was never published as a formal paper. Geoffrey Hinton introduced it in lecture six of his Coursera course on neural networks, and that lecture slide is the source the method is universally cited to. The idea was a fix for a weakness in AdaGrad. AdaGrad divides each parameter’s learning rate by the square root of the sum of all its past squared gradients; because that sum only ever grows, the effective learning rate keeps shrinking and eventually training stalls. RMSprop replaces the cumulative sum with a moving average of recent squared gradients. The “RMS” stands for root mean square: each step is scaled down by the recent root-mean-square magnitude of that parameter’s gradient.

The practical effect is that parameters with large, volatile gradients take smaller, steadier steps, while parameters with small gradients take larger ones, and, unlike AdaGrad, the learning rate does not decay to zero. This makes RMSprop robust on the noisy, non-stationary objectives common in deep learning, especially recurrent networks.

For a general reader, RMSprop is a small but telling piece of machine learning history: a widely used algorithm that entered the field not through a journal but through a lecture slide, and whose central trick, remembering recent gradient sizes rather than all of them, is one of the building blocks of the optimizers training today’s models.

Sources

Related