Red Teaming Language Models with Language Models

“Red Teaming Language Models with Language Models” was submitted to arXiv on February 7, 2022 by Ethan Perez, Saffron Huang, Geoffrey Irving, and colleagues at DeepMind. It addresses a deployment blocker: language models can produce harmful output in ways that are hard to anticipate, and finding those failures by hand is slow and expensive.

The idea is to automate the red team. Rather than relying only on human testers, the authors use a separate language model to generate large numbers of test prompts designed to provoke a target model, then use a classifier to flag harmful responses. They explore several ways to produce the test cases, from simple zero-shot generation to reinforcement learning that actively steers the prompt generator toward inputs that break the target.

Applied to a dialogue model, the method uncovered a range of problems at scale: offensive replies, leakage of private training data, the model generating contact information, and conversations that drifted into harmful territory over multiple turns. Because the attacks were generated automatically, the team could find failure modes in volumes no human team could match.

For a general reader, this paper helped establish automated red-teaming as a standard part of responsible model release. Today, scanning a model for unsafe behavior with adversarial AI-generated prompts before launch is routine practice - an approach this work helped pioneer.

Sources

Last verified June 7, 2026