ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

“ZeRO: Memory Optimizations Toward Training Trillion Parameter Models” was submitted to arXiv on October 4, 2019 by Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He of Microsoft. It became the core of DeepSpeed, Microsoft’s open-source library for large-scale model training.

The paper attacks a memory problem. In standard data-parallel training, every GPU holds a full copy of the model weights, gradients, and optimizer state, which wastes memory and caps model size. ZeRO, the Zero Redundancy Optimizer, partitions these across GPUs so each device stores only a slice, eliminating the redundancy while keeping communication low. The authors report this allowed an 8x increase in trainable model size and a 10x performance gain over the state of the art, with a path toward trillion-parameter models, and they used it to train Turing-NLG, then the world’s largest language model at 17 billion parameters.

ZeRO and DeepSpeed, alongside NVIDIA’s Megatron-LM tensor parallelism, became foundational plumbing for the large-model era. Many open and closed models above tens of billions of parameters were trained with these techniques combined.

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Sources

Related