TCAV (Testing with Concept Activation Vectors)

“Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)” was submitted to arXiv on November 30, 2017 by a Google team led by Been Kim, with co-authors including Martin Wattenberg, Fernanda Viegas, and James Wexler, and was published at ICML 2018. It addressed a limitation of earlier explanation methods: they tend to attribute a prediction to individual pixels or input features rather than to concepts a person can name.

TCAV’s machinery is a Concept Activation Vector, or CAV. To test a concept such as “stripes,” a user supplies a handful of example images of stripes and some random non-stripe images. A linear classifier is trained on the network’s internal activations to separate them; the direction perpendicular to that boundary is the CAV - a vector in the model’s hidden space that points toward the concept. TCAV then measures, using directional derivatives, how sensitive a class prediction is to moving in that concept direction. The headline example asks how much the concept “stripes” matters to the model’s prediction of “zebra.”

The payoff is explanations in human terms - “this model relies heavily on the striped-texture concept for zebras” - and the ability to surface unintended dependencies, such as a model leaning on gender or race concepts for an unrelated task. Because the user defines the concept after training with just a few examples, TCAV does not require labeling concepts throughout the dataset or retraining the model.

Why business readers should care: TCAV lets domain experts ask whether a model uses the factors they care about - and whether it secretly relies on ones it should not, such as a protected attribute - in language a non-engineer can state.

Sources

Last verified June 7, 2026