“Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer” was submitted to arXiv on January 23, 2017 by Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. It is the paper that turned mixture-of-experts from an old idea into a practical way to build enormously large models.
The architecture inserts a layer made of “thousands of feed-forward sub-networks,” the experts. For each input, “a trainable gating network determines a sparse combination of these experts to use” - meaning only a handful of experts actually run on any given example, even though the model as a whole contains far more. This is the crucial move: capacity (total parameters) is decoupled from the compute spent per example. The paper reported “greater than 1000x improvements in model capacity with only minor losses in computational efficiency,” with models scaling to as many as 137 billion parameters, and demonstrated gains on language modeling and machine translation at lower computational cost than dense baselines.
The work solved practical headaches that had kept conditional computation from working at scale - keeping the experts balanced so the gating network does not collapse onto a few favorites, and making the sparse routing efficient on modern hardware. These lessons fed directly into later sparse models, most directly the Switch Transformer, and sparse mixture-of-experts is now a standard design in frontier models.
The paper is also a milestone in Noam Shazeer’s career, pairing with his work on the Transformer the same year: one paper gave the field its dominant architecture, the other the efficiency technique used to scale it.