Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

“Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet” was published on May 21, 2024 by Anthropic. It took the sparse-autoencoder approach proven on a toy one-layer model in 2023 and applied it, for the first time, to a modern production large language model: the middle layer of Claude 3 Sonnet.

Using the same dictionary-learning technique at much larger scale, the team extracted millions of interpretable features spanning concrete entities - specific cities, people, and chemical elements - and abstract notions like bugs in code, sycophancy, deception, and gender bias. They described the result as the first detailed look inside a deployed, frontier-grade model.

The most important demonstration was causality. Artificially amplifying or suppressing a feature reliably changed the model’s behavior in the expected way - clamping a “Golden Gate Bridge” feature, for example, made the model fixate on the bridge across unrelated conversations. This showed the features were not just correlated decorations but components the model actually uses to compute its outputs, including features tied to safety-relevant behaviors.

For a business reader, this is the moment interpretability moved from small lab demos toward something usable on real systems. The ability to find and adjust the internal concepts of a shipped model points toward concrete safety and auditing tools - the capacity to detect when a model is, for instance, being deceptive or biased, by looking inside rather than only at its words.

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Sources

Related