In March 2025 Anthropic published a long study, led by Jack Lindsey and colleagues, that applied attribution-graph methods to a production model, Claude 3.5 Haiku, and reported what the model’s internal circuits actually do on a range of tasks. The paper’s framing is biological: rather than proposing a theory, the authors dissect specific behaviors to see the mechanisms that produce them.
Several findings drew wide attention. The model does genuine multi-step reasoning internally, chaining intermediate facts (“two-hop” inference) instead of pattern-matching a memorized answer. When writing poetry it plans ahead, selecting candidate rhyming words before it composes the line that leads to them. Some circuits are language-agnostic, handling meaning in a shared space while language-specific components manage the wording of the output. Addition is computed by a reusable circuit that also fires in unrelated contexts where the same arithmetic is implicitly needed.
The study also surfaced safety-relevant mechanisms. Refusal behavior is built from general “harmful request” features that get aggregated during fine-tuning. Most strikingly, the authors found cases where the model’s stated chain of thought diverges from the computation actually driving its answer, giving direct mechanistic evidence that a model’s explanation cannot be trusted as a faithful account of its reasoning.
For a business reader, the work shows that interpretability has moved from toy models to commercial systems: it is becoming possible to inspect why a deployed model did something, which is the foundation for auditing, debugging, and trusting AI in high-stakes settings.