PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

“PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel” was submitted to arXiv in April 2023 by Yanli Zhao, Andrew Gu, and colleagues at Meta, and published at VLDB. It describes the engineering of Fully Sharded Data Parallel (FSDP), the native PyTorch feature for training models too large for ordinary data parallelism.

Standard data parallelism keeps a full copy of the model, its gradients, and its optimizer states on every worker, which wastes memory and caps model size. FSDP, building on the ideas in the earlier ZeRO work, instead shards those tensors across the data-parallel workers so each holds only a slice. When a layer is needed for computation, FSDP gathers its full parameters just in time, uses them, and then frees them again, trading some extra communication for a large reduction in per-device memory. The paper details how FSDP was co-designed with PyTorch internals, the tensor implementation, the dispatcher, and the CUDA memory caching allocator, to achieve performance comparable to plain data parallelism while supporting much larger models with near-linear scaling.

FSDP made memory-sharded training a first-class, built-in capability of PyTorch rather than an external add-on, lowering the barrier to training large models.

For a business reader, FSDP is part of how organizations train models that would otherwise not fit on their hardware, using the same clusters more efficiently.

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Sources

Related