Gradient Checkpointing

Gradient checkpointing, also called activation recomputation or re-materialization, is a memory-saving technique for training deep neural networks. During a normal forward pass, the network stores the intermediate outputs (activations) of every layer because they are needed again when computing gradients in the backward pass. For very deep models this stored activation memory can dwarf the memory used by the model’s own parameters.

The idea, analyzed in Tianqi Chen and colleagues’ 2016 paper “Training Deep Nets with Sublinear Memory Cost,” is to keep only a small subset of activations, the checkpoints, and discard the rest. When the backward pass needs a discarded activation, the network simply recomputes it by re-running the forward computation from the nearest checkpoint. This trades extra computation for a large reduction in memory: the paper shows training can be done with memory growing only as the square root of the number of layers, at the cost of roughly one extra forward pass. The technique is now built into major frameworks and is routinely combined with parallelism strategies, GPipe, for instance, relies on it, to fit larger models or larger batches onto fixed hardware.

Why a business reader should care: gradient checkpointing is a common way teams train bigger models on the GPUs they already have, spending a bit more compute time to overcome a hard memory limit.

Sources

Related