Fast Inference from Transformers via Speculative Decoding

“Fast Inference from Transformers via Speculative Decoding” was submitted to arXiv on November 30, 2022 by Yaniv Leviathan, Matan Kalman, and Yossi Matias of Google Research. It introduced a technique for making large language models generate text faster without changing the model, retraining it, or altering what it produces.

The slowness it attacks is structural. Autoregressive models generate one token at a time, each token requiring a full forward pass through the network, so producing a hundred tokens means a hundred serial runs of an enormous model. Speculative decoding exploits the observation that much of generation is easy - common words, predictable continuations - and only occasionally hard. A small, cheap “draft” model proposes several tokens ahead in one quick burst. The large target model then verifies all those guesses in a single parallel forward pass, accepting the prefix that matches what it would have produced and correcting at the first disagreement. A carefully designed sampling rule guarantees the output distribution is exactly identical to running the large model alone - no quality loss, just speed.

The authors reported 2x to 3x acceleration on T5-XXL with no change in outputs and no retraining. Because the guarantee is exact and the method needs no model modification, it was adopted broadly: most production inference stacks now ship some form of speculative decoding, and later variants drop the separate draft model in favor of the model drafting for itself.

Why business readers should care: inference is where deployed models spend most of their cost, and latency shapes user experience. Speculative decoding is one of the quiet engineering wins that made serving large models cheaper and chat assistants feel faster, with no trade-off in answer quality.

Fast Inference from Transformers via Speculative Decoding

Sources

Related