Test-time compute, also called inference-time compute, is the practice of getting better results from a language model by letting it do more work when answering a question - as opposed to making the model larger or training it longer. The model’s weights stay fixed; what changes is how much computation is spent per query.
There are several ways to spend it. Sampling many candidate answers and taking a majority vote (self-consistency) is one. Searching over branching reasoning paths with lookahead and backtracking (tree of thoughts) is another. Generating a long internal chain of reasoning before answering, as modern reasoning models do, is a third. A common pattern pairs generation with a verifier or process reward model that scores candidate solutions and steers the search toward better ones.
The strategic importance of this idea grew sharply in 2024. The Snell et al. paper “Scaling LLM Test-Time Compute Optimally” showed that allocating inference budget according to how hard each prompt is can be more than four times as efficient as a naive baseline, and that on some problems a smaller model given extra thinking time matches a model fourteen times its size at equal total compute. This reframed inference as a tunable resource, not a fixed cost, and underpins the “reasoning model” generation of systems.
Why business readers should care: test-time compute is a direct dial between cost and quality. For hard tasks, paying more per query can be cheaper and more reliable than buying a bigger model - a trade-off worth understanding when budgeting AI workloads.