Toy Models of Superposition

“Toy Models of Superposition” was published on September 14, 2022 by an Anthropic team led by Nelson Elhage and Tristan Hume, with co-authors including Chris Olah, Martin Wattenberg, and Dario Amodei, in the Transformer Circuits Thread. It offered an explanation for one of the deepest obstacles to interpreting neural networks: why individual neurons so often respond to many unrelated concepts at once, a property called polysemanticity.

The proposed answer is superposition. A network frequently needs to track more distinct features than it has dimensions to store them in. When those features are individually rare - active only occasionally - the network can pack several of them into the same set of neurons by representing them as overlapping directions, accepting a little interference in exchange for representing far more than one feature per neuron. The authors use the analogy of many radio stations sharing a band. In small, fully understandable toy networks they show this happening and map out a “phase diagram” predicting when a model stores a feature cleanly, drops it, or puts it in superposition.

The paper drew speculative links between superposition and several puzzles - adversarial examples, grokking, and the behavior of mixture-of-experts models - and framed a research agenda: if features are hidden in superposition, then finding them again, perhaps with sparse dictionaries, is the path to reading a network’s internals. That agenda led directly to Anthropic’s later work extracting interpretable features from production models.

Why business readers should care: superposition explains why you cannot simply read off what a model “knows” by inspecting its neurons, and why interpretability is hard - a key fact for anyone hoping to audit or certify AI behavior.

Sources

Related