“Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters,” submitted to arXiv on August 6, 2024 by Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar, gave a careful empirical account of a shift that the o1-style reasoning models would soon make famous: getting a model to think harder at inference rather than making it bigger.
The paper studies two ways to spend extra compute at inference time. One is searching against a process-based verifier reward model, which scores intermediate reasoning steps and guides the search toward better answers. The other is adaptively revising the model’s own answer distribution for a given prompt. The central finding is that which method works best depends on how hard the prompt is - so the smart move is a “compute-optimal” strategy that allocates inference budget per prompt according to difficulty.
The numbers are striking. The compute-optimal approach improved efficiency by more than 4x over a standard best-of-N baseline. And on some problems, a smaller model given extra test-time compute matched the performance of a model 14 times its size at equal total compute. That reframes the cost equation: for some workloads, paying at inference is cheaper than paying for a larger model.
Why business readers should care: this work is part of the foundation for the “reasoning model” era. It shows that inference-time compute is a real, tunable resource - and that for hard tasks, letting a smaller model think longer can be more economical than buying a bigger one.