“Mamba: Linear-Time Sequence Modeling with Selective State Spaces” was submitted to arXiv on December 1, 2023 by Albert Gu and Tri Dao, of Carnegie Mellon and Princeton. It is the most prominent of the state-space model architectures, the main serious challenger to the Transformer’s dominance in sequence modeling.
The Transformer’s attention compares every token to every other token, which costs compute and memory that grow with the square of the sequence length - the reason long context is expensive. State-space models take a different route, processing a sequence in a single recurrent sweep whose cost grows only linearly with length, like a classical RNN but with the mathematics of continuous-time linear systems. The catch with earlier state-space models, such as S4, was that their dynamics were fixed regardless of the input, so they could not selectively focus on or ignore specific tokens the way attention can. Mamba’s key move was to make the state-space parameters depend on the input - a “selective” mechanism that lets the model choose, token by token, what to remember and what to forget.
The authors paired this with a hardware-aware implementation and reported up to 5x higher inference throughput than Transformers of comparable size, with a 3-billion- parameter Mamba matching Transformers twice its size on language tasks, plus strong results in audio and genomics. Because its cost is linear, Mamba is especially attractive for very long sequences.
Mamba did not dethrone the Transformer, which remains the default for frontier models, but it kept the question of alternative architectures alive and spawned a wave of state-space and hybrid models. It is the clearest recent reminder that attention, for all its success, is not the only way to model a sequence.