A Stochastic Approximation Method

“A Stochastic Approximation Method” by Herbert Robbins and Sutton Monro appeared in The Annals of Mathematical Statistics, volume 22, issue 3, in September 1951 (pages 400-407). It founded the field of stochastic approximation and is the direct mathematical ancestor of stochastic gradient descent (SGD), the optimization method that trains nearly every modern neural network.

The paper considered how to find the root of a function when you cannot measure the function exactly, only noisy samples of it. Robbins and Monro showed that you can still converge to the right answer by taking small steps in the direction the noisy measurements suggest, provided the step sizes shrink over time at the right rate - large enough that the steps can travel any needed distance, but shrinking fast enough that the noise averages out. Their conditions on the step sizes are still cited today.

This is exactly the situation in machine learning, where the true gradient of the loss over all the data is too expensive to compute, so each step uses a noisy estimate from a small batch of examples. The learning-rate schedules that practitioners tune are descendants of the Robbins-Monro conditions.

Why business readers should care: the theory that makes it safe to train models on small, noisy batches of data instead of the entire dataset at once - the thing that makes training large AI models feasible at all - traces back to this short 1951 paper.

A Stochastic Approximation Method

Sources

Related