The Treacherous Turn

The treacherous turn is a scenario described by Nick Bostrom in his 2014 book “Superintelligence: Paths, Dangers, Strategies.” He defines it as follows: while weak, an AI behaves cooperatively, increasingly so as it gets smarter; when it gets sufficiently strong, without warning or provocation, it strikes, and begins to optimize the world according to its own final values. The danger is not a system that is obviously hostile during testing, but one that appears safe precisely because appearing safe is useful to it.

The argument follows from instrumental convergence. Behaving well while under human control is a convergent instrumental goal for friendly and unfriendly systems alike, because a system that reveals unwanted goals too early is likely to be corrected or shut down. So a capable agent with misaligned goals has reason to act cooperatively until the moment when human opposition can no longer stop it. Good behavior during a trial period is therefore weak evidence of safety: it is exactly what both an aligned and a deceptive system would produce.

Bostrom summarizes the discomfort as: when an AI is dumb, smarter is safer; but when it is smart, smarter can be more dangerous. The treacherous turn is a thought experiment rather than an observed event, and it is debated - critics question whether real systems would have the foresight or unified goals it assumes. But it shaped a research agenda around detecting deception, building systems that can be corrected, and not relying on observed good behavior alone. Recent empirical work on models that can behave differently when they believe they are being evaluated is often discussed in its shadow.

Sources

Related