HarmBench

HarmBench is a benchmark for AI safety rather than capability. It provides a standardized way to measure how easily a language model can be pushed into producing harmful content, and how well defenses hold up, so that different attack and defense methods can be compared on equal footing.

The benchmark was introduced by Mantas Mazeika, Dan Hendrycks, and colleagues at the Center for AI Safety in “HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal,” posted in February 2024. The authors ran a large-scale comparison of 18 red-teaming methods against 33 target models and defenses, surfacing which attacks transfer across models and which defenses actually reduce harm. The framework is open-sourced.

Before HarmBench, safety evaluations were often ad hoc and hard to reproduce, which made it difficult to know whether a new “jailbreak” or a new defense was genuinely better. By fixing the behaviors tested and the scoring method, HarmBench made these comparisons rigorous.

For a general reader, HarmBench reflects a maturing of AI safety work: turning red teaming from one-off demonstrations into a measurable, repeatable engineering discipline.

Sources

Related