Data Parallelism

Data parallelism is the most common way to speed up neural-network training across many devices. The model is copied onto every device, the training batch is split so each device processes a different subset of examples, and each device computes gradients on its own subset. The devices then exchange and average those gradients, typically with a collective operation such as all-reduce, so that every copy of the model applies the same update and stays identical. The effect is the same as training on the whole batch at once, but the work is spread across the hardware.

This approach scales well because the only cross-device communication is the gradient exchange, which can be made efficient with patterns like ring-allreduce, as popularized by Uber’s Horovod. Its limitation is memory: every device must hold a full copy of the model, its gradients, and its optimizer states, so it cannot by itself train a model larger than a single device’s memory. Techniques like ZeRO and PyTorch’s Fully Sharded Data Parallel address this by sharding those redundant copies across the data-parallel workers, while tensor and pipeline parallelism split the model itself.

Why a business reader should care: data parallelism is the default lever for training faster on more hardware, and understanding its memory ceiling explains why the largest models need additional, more complex parallelism strategies.

Sources

Related