AI Control

AI control is a safety approach that starts from a pessimistic assumption: suppose a powerful model is misaligned and will actively try to cause harm or evade its controls. Instead of trying to guarantee the model is trustworthy, control research asks how to design deployment protocols that remain safe even in that worst case. The agenda was set out by Ryan Greenblatt, Buck Shlegeris, and colleagues at Redwood Research in their December 2023 paper “AI Control: Improving Safety Despite Intentional Subversion.”

The distinction from alignment is the point. Alignment tries to make the model want the right things; control tries to make the system safe whether or not the model is aligned, by treating a capable model as an untrusted insider. The classic setup pairs a powerful but untrusted model with a weaker trusted model and a limited amount of trusted human labor, then designs protocols that get useful work done while catching or blocking sabotage.

The original paper studied a programming task where an untrusted GPT-4 might insert subtle backdoors that pass the test cases. Protocols like “trusted editing,” where a weaker model rewrites suspicious code, and “untrusted monitoring,” where instances of the strong model check each other, were evaluated for how well they prevented backdoors without sacrificing too much usefulness. This control framing later shaped Anthropic’s sabotage evaluations and broader thinking about overseeing capable models.

For organizations, AI control reframes safety as a security problem: assume the system could be adversarial, and build checks, monitors, and limited-trust workflows accordingly, rather than betting everything on the model being benign.

Sources

Related