This is the companion methods paper to “On the Biology of a Large Language Model,” published by Anthropic in March 2025. It explains the technique used to produce the attribution graphs in that study: a way to trace, step by step, how features inside a language model combine to produce a given output.
The core idea is to build a “replacement model” in which the dense, hard-to-interpret MLP layers are swapped for cross-layer transcoders (CLTs). A CLT decomposes the model’s activations into sparsely active, individually interpretable features that can read from and write to multiple layers at once. Because the substituted components interact more simply, the authors can construct an attribution graph whose nodes are features and whose edges show how one feature drives another toward the final answer.
To check that these graphs reflect the real model and not just the replacement, the method freezes attention patterns and normalization (so feature interactions become linear) and runs perturbation experiments: intervening on a feature should change the output in the way the graph predicts. The paper includes case studies on acronym generation, factual recall, and arithmetic, with quantitative measures of how faithful and interpretable the resulting graphs are.
For practitioners, this is the plumbing behind modern circuit analysis. It turns “why did the model say that” from a guessing game into a traceable computation, which matters for anyone who needs to audit or certify model behavior.