Dangerous Capability Evaluation

A dangerous capability evaluation is a structured test of whether an AI model possesses a specific skill that could cause serious harm, such as offensive cyber operations, providing uplift toward biological or chemical weapons, manipulation and deception, or autonomously replicating and acquiring resources. The idea is to measure these capabilities empirically before a model is widely deployed, so that developers and regulators can make informed decisions rather than discovering the capability after release.

The framing was crystallized in the 2023 paper “Model Evaluation for Extreme Risks” by Toby Shevlane and 20 co-authors, including Yoshua Bengio and Paul Christiano. They argued that frontier developers need two complementary kinds of evaluation: dangerous capability evaluations, which ask what a model can do, and alignment evaluations, which ask whether a model is inclined to use its capabilities for harm. Because general-purpose systems tend to acquire both beneficial and harmful abilities, the authors treat such evaluations as a foundational safety mechanism.

This concept now underpins much of frontier AI governance. Organizations like METR and the UK AI Security Institute run these evaluations, benchmarks like Cybench, WMDP, and AgentHarm operationalize them, and Responsible Scaling Policies tie safety commitments to measured capability thresholds.

For a business audience, dangerous capability evaluation is the practical core of AI safety: it is how the field tries to answer, with evidence rather than speculation, whether a given model is too risky to release.

Sources

Last verified June 7, 2026