Mapping the Mind of a Large Language Model

On 21 May 2024 Anthropic published “Mapping the Mind of a Large Language Model,” reporting that its interpretability team had extracted millions of interpretable features from Claude 3 Sonnet, one of the company’s deployed production models. The work, detailed in the accompanying paper “Scaling Monosemanticity,” applied a technique called a sparse autoencoder - a form of dictionary learning - to decompose the model’s internal activations into features that correspond far more closely to single human-interpretable concepts than the raw neurons do. This was the first time the method, developed earlier on small toy models, had been shown to scale to a model at the frontier of commercial deployment.

The extracted features ranged from concrete entities - cities, people, chemical elements - to abstract notions such as bugs in code, gender bias, and the idea of keeping a secret. Anthropic reported that features activated on the same concept across different languages and even across images, and that the internal organization of concepts corresponded, at least somewhat, to human notions of similarity. The team also located features tied to potentially harmful behavior, such as scams, software backdoors, and sycophancy, while emphasizing that the work surfaced capabilities the model already had rather than adding new ones.

The most vivid demonstration was “Golden Gate Claude.” The researchers found a feature that fires on mentions of the Golden Gate Bridge, then artificially amplified it. The altered model began to fixate on the bridge, even answering a question about its physical form by claiming to be the Golden Gate Bridge itself. The demonstration made concrete a longstanding goal of interpretability: not just observing which feature corresponds to a concept, but intervening on it to change behavior.

The result is widely regarded as an empirical landmark for interpretability, showing that the approach can move from laboratory models to systems people actually use. It strengthened the case that understanding a model’s internals could become a practical tool for auditing and steering behavior - a prerequisite for safety in high-stakes settings - rather than remaining a purely academic exercise.

Note on sourcing: the Anthropic announcement page was fetched and verified live. The “Scaling Monosemanticity” paper at transformer-circuits.pub is a single very large interactive page (about 16 MB) that exceeds automated fetch limits; it returns HTTP 200 and is the canonical primary record of the technical work.