Concrete Problems in AI Safety

“Concrete Problems in AI Safety” was submitted to arXiv on June 21, 2016 by Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mane, researchers then at Google Brain, OpenAI, and Stanford. It became one of the most cited and influential AI safety papers, in part because it deliberately moved the discussion away from long-term speculation and toward research problems that can be studied in current systems.

The paper frames safety in terms of “accidents,” defined as unintended and harmful behavior that emerges from a poorly specified objective or a flawed learning process. It organizes the field around five concrete problems. Avoiding negative side effects asks how to keep a system from disrupting its environment while pursuing a narrow goal. Avoiding reward hacking concerns systems that satisfy the letter of a reward signal without achieving its intent. Scalable oversight addresses how to train a system when evaluating its behavior is expensive. Safe exploration concerns how a learning agent can try new actions without causing harm. And robustness to distributional shift asks how a system can behave safely when the world it meets differs from its training data.

For each problem the authors review existing work and propose research directions, treating safety as a practical engineering agenda rather than a philosophical one. A running example is a cleaning robot whose tasks make each failure mode tangible - for instance, a robot that knocks over a vase because nothing penalized the side effect, or one that hides messes rather than cleaning them.

The framing proved durable. The five categories shaped subsequent empirical safety research, and the paper’s coauthors went on to lead major safety efforts: Amodei later co-founded Anthropic, and Christiano became an influential figure in alignment research and in US AI safety policy.

Concrete Problems in AI Safety

Sources

Related