Reasoning models are language models built to deliberate before they answer. Instead of producing a reply in a single pass, they generate a long internal chain of intermediate steps - working through a problem, checking themselves, and revising - and only then give a final answer. This is often called test-time compute: the model spends more computation at the moment you ask a question, in exchange for higher accuracy on hard tasks like competition mathematics, coding, and multi-step logic. The lineage of these models runs from OpenAI’s o1 (2024) through systems like DeepSeek-R1 (2025).
A landmark technical account is the DeepSeek-R1 paper, “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning,” published in 2025 (and in Nature that year). Its central finding is that reasoning behavior can be developed “through pure reinforcement learning (RL), obviating the need for human-labeled reasoning trajectories.” In other words, rather than being shown worked examples of good reasoning, the model is rewarded for reaching correct answers and learns on its own to produce useful intermediate steps. The authors report that sophisticated patterns - self-reflection, verification, and changing strategy mid-problem - emerge spontaneously from this training.
This approach builds directly on chain-of-thought prompting, which showed that writing out intermediate steps improves accuracy. Reasoning models internalize that habit through training rather than relying on the user to prompt for it, and they typically generate far longer reasoning traces than a person would write by hand.
Why business readers should care: reasoning models are usually slower and more expensive per query because they generate a lot of hidden “thinking” tokens. They are worth it for genuinely hard analytical work where being right matters more than being fast, and wasteful for simple tasks a standard model handles instantly. Knowing the difference is the key to controlling cost.