Mixture-of-Depths: Dynamically Allocating Compute in Transformers

“Mixture-of-Depths: Dynamically allocating compute in transformer-based language models,” submitted to arXiv on April 2, 2024 by David Raposo, Sam Ritter, and colleagues at Google DeepMind, addressed a simple inefficiency in standard Transformers: every token is processed by every layer with the same amount of compute, even though not all tokens are equally hard to predict.

The method, abbreviated MoD, adds a routing mechanism at each layer. A learned router selects the top-k tokens that will be processed by that layer’s attention and feed-forward computation, while the remaining tokens skip the layer entirely via the residual connection. Because k is fixed in advance, the total compute budget is known and predictable, unlike methods where cost varies unpredictably with the input. The result is a model that allocates more computation to the positions that need it and less to the rest, all learned end to end.

The authors reported that MoD models can match the performance of standard Transformers at equivalent total compute and training time, while using only a fraction of the floating-point operations per forward pass, making them up to roughly 50 percent faster during post-training sampling.

The idea is closely related to mixture-of-experts, which routes tokens across width; mixture-of-depths instead routes across depth. For anyone running large models, conditional computation like this is a direct lever on inference cost, letting a model spend its compute where it actually matters.

Mixture-of-Depths: Dynamically Allocating Compute in Transformers

Sources

Related