A Mathematical Framework for Transformer Circuits

“A Mathematical Framework for Transformer Circuits” was published on December 22, 2021 by an Anthropic team including Nelson Elhage, Neel Nanda, Catherine Olsson, Chris Olah, and colleagues, as the founding article of the Transformer Circuits Thread. It set out a way to read a transformer not as an opaque pile of weights but as a set of interpretable computational paths.

The framework starts with the simplest cases and builds up. A zero-layer transformer can only model bigram statistics - which token tends to follow which. A one-layer attention-only model can be decomposed into a sum of independent paths. The decisive insight comes with two layers, where attention heads can compose: one head’s output feeds another, enabling algorithms the single-layer model cannot express. From this composition the authors identified the induction head, a circuit that finds an earlier occurrence of the current token and copies what followed it, giving the model a mechanism for in-context pattern completion.

Methodologically the paper showed that attention can be rewritten in terms of query-key and output-value circuits, exposing a large amount of clean linear structure that can be analyzed directly rather than probed statistically. This reframing made it possible to treat interpretability as reverse-engineering, with specific, testable claims about what each component computes.

For a business reader, this is the paper that turned “we cannot understand what is inside these models” into a research program with results. The induction heads it described and the methods it introduced underpin much of the interpretability work that followed at Anthropic and across the field.

A Mathematical Framework for Transformer Circuits

Sources

Related