“The Illusion of Thinking” is a June 2025 paper by Apple researchers including Iman Mirzadeh, Mehrdad Farajtabar, and Samy Bengio, accepted to NeurIPS 2025. It studies large reasoning models, the class of systems that generate long chains of thought before answering, by testing them on controllable puzzle environments such as Tower of Hanoi and river-crossing problems where the difficulty can be dialed up smoothly. Using puzzles rather than standard benchmarks lets the authors avoid contamination and measure exactly how performance changes as a problem gets harder.
The central finding is a complete accuracy collapse beyond a certain complexity threshold: as puzzles grow harder, reasoning models do well, then degrade, then fall to essentially zero. More surprisingly, the models’ reasoning effort, measured in tokens spent thinking, first rises with difficulty and then declines even though they still have token budget left, as if they give up. The paper identifies three regimes, where on easy problems plain models can match reasoning models, on medium problems reasoning helps, and on hard problems both fail, and it shows the models struggle with exact step-by-step computation even when handed the algorithm.
The paper provoked vigorous debate, including responses arguing that some failures reflected output-length limits rather than reasoning limits. Either way it matters because it tempers the narrative that chain-of-thought models simply “think.” For businesses, it is a reminder to test reasoning systems at the complexity their real tasks demand, since impressive performance on moderate problems does not guarantee it scales.