FP8 Formats for Deep Learning

This 2022 paper, jointly authored by researchers at NVIDIA, Arm, and Intel, proposes a common pair of 8-bit floating-point formats for deep learning, an effort to standardize precision below the 16-bit formats that had become widespread. Going to 8 bits halves memory and bandwidth needs again and roughly doubles arithmetic throughput, but at 8 bits the tradeoff between numerical range and precision becomes acute, so the authors define two formats tuned for different uses.

The first format, E4M3, uses 4 exponent bits and 3 mantissa bits for more precision and is intended mainly for the forward pass and weights. The second, E5M2, uses 5 exponent bits and 2 mantissa bits for a wider dynamic range and is aimed at gradients in the backward pass. The paper demonstrates that training with these formats matches 16-bit accuracy across convolutional networks, recurrent networks, and Transformers, including language models up to 175 billion parameters, without changing the training recipe or hyperparameters.

The proposal mattered because it was a cross-vendor agreement, and FP8 quickly became a hardware feature: NVIDIA’s Hopper GPUs shipped a Transformer Engine built around FP8, and Blackwell extended the idea to even lower precision. For a general reader, this paper marks the point where rival chipmakers agreed on a shared low-precision number format, ensuring models trained on one vendor’s hardware behave predictably on another’s.

FP8 Formats for Deep Learning

Sources

Related