Mixed Precision Training

Published in 2017 by Paulius Micikevicius and colleagues at NVIDIA and Baidu, this paper established a now-standard technique for training large neural networks faster and with less memory by using half-precision (16-bit) floating-point numbers instead of full 32-bit precision. The challenge is that 16-bit floats have a much smaller numerical range, which can cause small gradient values to vanish to zero during training.

The authors propose two fixes that together make 16-bit training reliable. First, they keep a single master copy of the model weights in full 32-bit precision that accumulates the tiny updates from each optimizer step, so small changes are not lost. Second, they apply loss scaling, multiplying the loss by a constant before computing gradients to push small values into the representable range, then scaling back afterward. With these techniques the method matches full-precision accuracy across image, speech, language, and generative models while roughly halving memory consumption.

The timing was deliberate: NVIDIA’s Volta GPUs had just introduced Tensor Cores that ran 16-bit matrix math far faster than 32-bit. Mixed precision turned that hardware capability into a practical training recipe. For a business reader, this paper is a key reason modern AI models can be trained at all within reasonable cost and memory budgets, and the approach remains the default for nearly every large model trained today.

Sources

Related