The Self-Correction Illusion: LLMs Correct Others but Not Themselves

This paper revisits the well known finding that large language models struggle to fix errors in their own reasoning traces, yet correct the same errors more readily when those errors appear to come from an outside source. The authors argue the gap is not a cognitive limitation but an artifact of how errors are presented in chat templates.

To test this, the researchers kept an erroneous claim word-for-word identical and varied only its role label: presenting it as the model’s own prior thought, as a user message, as a tool response, or as system memory. Relabeling a claim from an internal source to an external one increased explicit-correction rates by 23 to 93 percentage points across the models tested. The effect held across seven model families and three task domains, with 10 of 13 test conditions reaching statistical significance at p below 0.001.

Building on the finding, the team designed a prompt-structure intervention that requires no model retraining. It exploits the same labeling artifact, and its effectiveness varied by domain: framing errors as memory blocks worked best for math tasks, while presenting them as plain user messages worked best for logical deduction.

The result matters because self-correction is a common assumption behind agentic and multi-step reasoning systems. If a model’s willingness to correct an error depends on the role label attached to that error rather than on the error itself, then reported self-correction abilities may be overstated, and simple prompt-structure changes may recover correction behavior without any retraining.

The Self-Correction Illusion: LLMs Correct Others but Not Themselves

Sources

Related