The Off-Switch Game

“The Off-Switch Game” was submitted to arXiv on November 24, 2016 by Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell, all then at UC Berkeley. It offers a formal treatment of corrigibility - the property of an AI system that allows humans to correct or shut it down.

The paper begins from a problem identified in earlier safety work: a sufficiently capable agent pursuing a fixed objective has an instrumental reason to prevent itself from being switched off, since it cannot achieve its goal while disabled. The authors model this as a simple game between a human and a robot, in which the human may press an off-switch and the robot may either comply, disable the switch, or act on its own.

Their central result is that the robot’s incentive to preserve the off-switch depends on how it treats its own objective. If the robot is certain it knows the correct objective, it has no reason to defer to the human and may disable the switch. But if the robot is uncertain about the true objective and treats the human’s decision to switch it off as informative evidence about that objective, then it has a positive incentive to allow the shutdown - because the human pressing the switch tells the robot something it did not know about what is actually wanted. As the abstract puts it, giving machines an appropriate level of uncertainty about their objectives leads to safer designs.

This uncertainty-based framing became a cornerstone of Stuart Russell’s later argument, developed at length in his 2019 book Human Compatible, that AI systems should be built to be deferential by remaining uncertain about human preferences rather than confidently optimizing a fixed goal.

Sources

Related