Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

“Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity” was submitted to arXiv on January 11, 2021 by William Fedus, Barret Zoph, and Noam Shazeer at Google. It pushed sparse mixture-of-experts models past the trillion-parameter mark and, just as importantly, made them simpler to train.

Earlier sparse mixture-of-experts layers routed each input to several experts and combined the results. The Switch Transformer’s central simplification is to route each token to exactly one expert - the “switch” - which the authors showed cuts communication and computation costs while preserving quality. With only one expert active per token, the model holds a vast number of parameters but spends a fixed, modest amount of compute on each token. The paper demonstrated pre-training “up to trillion parameter models on the Colossal Clean Crawled Corpus” (the C4 dataset used for T5) and reported large speedups in pre-training over dense baselines at the same compute budget.

The paper also tackled the engineering realities of sparse training: load-balancing losses to keep experts evenly used, lower-precision routing for stability, and distillation of the giant sparse model back down into a smaller dense one for deployment. These techniques became reference points for the wave of mixture-of-experts models that followed.

Switch Transformers, together with the 2017 sparsely-gated mixture-of-experts work, established the recipe that frontier labs now use to grow model capacity without growing per-token cost in lockstep.

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Sources

Related