bfloat16

bfloat16, short for brain floating point, is a 16-bit number format created by the Google Brain team to make neural network training faster and cheaper without sacrificing stability. A bfloat16 value uses one sign bit, eight exponent bits, and seven mantissa bits. The crucial design choice is keeping eight exponent bits, the same as standard 32-bit float, so bfloat16 has exactly the same dynamic range as full precision. It simply carries fewer fraction digits, trading precision for range.

This contrasts with IEEE half precision, FP16, which spends five bits on the exponent and ten on the mantissa. FP16 is more precise but has a much narrower range, which makes it prone to overflow and underflow during training and often requires careful loss scaling. Because neural networks turn out to be far more sensitive to the size of the exponent than to the number of mantissa bits, bfloat16 behaves almost like a drop-in replacement for 32-bit float, while FP16 needs more care.

There is a hardware payoff too: a multiplier’s silicon area grows with the square of mantissa width, so bfloat16 multipliers are roughly half the size of FP16 multipliers and about eight times smaller than 32-bit ones. Google’s TPUs perform matrix multiplication in bfloat16 while accumulating results in full precision, capturing speed without losing accuracy. For a general reader, bfloat16 is one of the quiet engineering decisions that makes training today’s largest models affordable, and it has since been adopted across GPUs and other AI chips.

Sources

Related