Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

“Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” was submitted to arXiv on January 28, 2022 by Jason Wei, Xuezhi Wang, Dale Schuurmans, Quoc Le, Denny Zhou, and colleagues at Google Research. It introduced one of the simplest and most consequential prompting techniques in the history of large language models.

The idea is almost trivial to state. Instead of asking a model for a final answer directly, you include a few worked examples in the prompt that show the intermediate reasoning steps - the chain of thought - leading to each answer. The model then imitates that pattern, writing out its own step-by-step reasoning before committing to an answer. On multi-step problems this produced large accuracy gains. The headline result was that a 540-billion-parameter model (PaLM), prompted with just eight chain-of-thought examples, set a new state of the art on the GSM8K grade-school math benchmark, beating a fine-tuned GPT-3 that had a separate answer verifier.

The paper also observed that the benefit was concentrated in large models; small models prompted the same way got little or no lift, and sometimes wrote plausible but wrong reasoning. This tied chain-of-thought to the contemporaneous debate about abilities that appear only at scale.

Chain-of-thought reshaped how people use and build language models. It became a default prompting move, spawned variants like zero-shot “let’s think step by step” and self-consistency, and is a conceptual ancestor of the reasoning models - o1, DeepSeek-R1, and their successors - that were later trained to generate long internal chains of thought automatically rather than being coaxed into it by examples.

Sources

Last verified June 7, 2026