AI Safety via Debate

“AI safety via debate” was submitted to arXiv on May 2, 2018 by Geoffrey Irving, Paul Christiano, and Dario Amodei of OpenAI. It proposes a mechanism for one of alignment’s hardest problems: how a human can supervise an AI on questions the human cannot fully evaluate on their own.

The proposal is a zero-sum debate game. Two copies of an AI argue opposing sides of a question in front of a human judge, who decides which argument is more convincing and declares a winner. The agents are trained by self-play to win this game. The bet behind the design is asymmetric: it is harder to convincingly defend a lie when a capable opponent can point out the flaw, so in equilibrium the truth should be easier to argue, and the judge can rule correctly even without being able to work out the answer themselves.

The paper frames debate as a form of scalable oversight - a way to extend human judgment to superhuman systems - and reports an early proof of concept on a sparse version of MNIST digit classification, where a debate over which pixels to reveal let a judge that sees only a few pixels reach far higher accuracy than it could unaided.

For a general reader, debate is one of the foundational ideas for keeping powerful AI honest. The intuition - pit systems against each other so a less-capable overseer can still tell who is right - reappears across modern alignment research and in practical work on AI-assisted evaluation.

Sources

Related