Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

“Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism” was submitted to arXiv in September 2019 by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro of NVIDIA. It tackled a practical limit: once a model is too big to fit in a single GPU’s memory, training it requires splitting the model itself across devices.

Megatron-LM introduced an efficient intra-layer model-parallel approach, often called tensor parallelism, which splits the matrix multiplications inside each transformer layer across GPUs and inserts only a few communication operations in native PyTorch. Because it needed no new compiler or custom library, it was straightforward to adopt. The team trained transformer models up to 8.3 billion parameters on 512 GPUs at 76 percent scaling efficiency, setting state-of-the-art results on language-modeling benchmarks such as WikiText103 and LAMBADA.

Megatron-LM became foundational infrastructure for the large-model era. Its parallelism techniques, combined with data and pipeline parallelism, are part of how subsequent multi-hundred-billion-parameter models were trained, and the Megatron codebase itself was reused and extended across the field.

Sources

Last verified June 7, 2026