Gradient Descent

Gradient descent is the workhorse optimization method behind training most modern machine learning models. It works by measuring the slope (gradient) of the error with respect to each model setting, then taking a small step in the direction that reduces the error. Repeating this many times gradually moves the model toward better settings. The numerical computation and optimization chapters of Goodfellow, Bengio, and Courville’s “Deep Learning” present gradient descent and its stochastic variant as the central tools for training deep networks.

In practice, models are trained with stochastic gradient descent, which estimates the gradient from small random batches of data rather than the whole dataset, making each step fast. The statistical roots of this idea go back to Herbert Robbins and Sutton Monro’s 1951 paper “A Stochastic Approximation Method” (Annals of Mathematical Statistics, DOI 10.1214/aoms/1177729586), which established that iterative updates from noisy samples can converge to a solution.

The size of each step, called the learning rate, matters a great deal. Too large and the model overshoots; too small and training crawls.

Why business readers should care: Gradient descent is why training large AI models is computationally expensive but feasible. It explains the heavy reliance on GPUs and large compute budgets, and why tuning training settings is part science, part craft.