Automatic Differentiation

Automatic differentiation (often shortened to autodiff or AD) is a technique for computing the exact derivative of a function expressed as a computer program. Rather than working out a formula by hand or estimating a slope with finite differences, autodiff treats the program as a composition of elementary operations - additions, multiplications, exponentials, and so on - each of whose derivative is known, and mechanically applies the chain rule of calculus to combine them. The result is a derivative that is correct to machine precision and computed automatically, no matter how the original program was written.

Autodiff sits between two older approaches and avoids the weaknesses of both. The survey “Automatic Differentiation in Machine Learning: a Survey” by Atilim Gunes Baydin, Barak Pearlmutter, Alexey Radul, and Jeffrey Siskind (JMLR, 2018) sets out exactly this distinction: numerical differentiation by finite differences suffers from rounding and truncation error, while symbolic differentiation can produce enormous, redundant expressions. Autodiff instead evaluates derivatives numerically while still applying the exact rules of differentiation, giving accurate results without expression blowup.

There are two main modes. Forward mode propagates derivatives alongside values as the program runs from inputs to outputs, and is efficient when there are few inputs. Reverse mode runs the program forward, records the operations, and then sweeps backward to accumulate derivatives, which is efficient when there are many inputs and few outputs. Training a neural network is exactly that case - millions of parameters, one scalar loss - so reverse-mode autodiff is the foundation of the technique machine learning calls backpropagation.

Every modern deep-learning framework is, at heart, an autodiff engine wrapped around array operations. PyTorch’s autograd documentation describes how it records the operations performed into a graph so that “by tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.” The programmer writes only the forward computation; the framework builds the derivative computation for free. This is what made it practical to experiment with arbitrary network architectures: change the forward code, and the gradients update themselves.

The software significance is hard to overstate. Autodiff turned model training from a hand-derivation chore into an automatic byproduct of writing ordinary array code. It is the connective tissue between the tensor data structure, the computational graph, and the optimization loop, and it is the single capability that distinguishes a deep-learning framework from a plain numerical library.

Sources

Related