GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

“GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism” was submitted to arXiv in November 2018 by Yanping Huang and colleagues at Google. It addressed a different memory wall than tensor parallelism: when a model is a long sequence of layers that together exceed the memory of a single accelerator, you can place different groups of layers on different devices.

The naive version of this idea is slow, because while one device computes a layer group the others sit idle. GPipe’s contribution is a batch-splitting pipelining algorithm: it divides each mini-batch into smaller micro-batches and feeds them through the chain of devices in a staggered pipeline, so that several devices are busy at once. Combined with re-materialization (recomputing activations during the backward pass instead of storing them), this lets GPipe scale almost any layer-sequential network with close to linear speedup. The authors demonstrated it by training a 557-million-parameter image model to 84.4 percent ImageNet accuracy and a 6-billion-parameter, 128-layer multilingual translation model covering over 100 languages.

Pipeline parallelism became one of the three standard axes, alongside data and tensor parallelism, used to train the largest models. The combination is often called 3D parallelism.

For a business reader, GPipe is part of why training models far larger than any single chip can hold became practical, which is a precondition for the frontier models in use today.

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Sources

Related