Autoregressive Model

An autoregressive model generates a sequence one piece at a time, where each new piece is predicted from all the pieces that came before it. For a language model this means producing text token by token: given everything written so far, the model outputs a probability distribution over the next token, one token is chosen, it is appended to the running text, and the process repeats. GPT-style models are autoregressive in exactly this way - “Language Models are Few-Shot Learners,” the GPT-3 paper, describes the model as an autoregressive language model trained simply to predict the next token.

The word comes from statistics, where “autoregression” means regressing a variable on its own past values. The same idea drives image generators that emit pixels in order, audio models like WaveNet that produce samples one at a time, and code models that write programs token by token. Training is efficient because the correct “next token” at every position is just the actual next token in the training data, so the whole sequence supplies a teaching signal at once. Generation, by contrast, is inherently sequential - each token must be produced before the next can be predicted - which is why generating long outputs is slower than reading them and why techniques like speculative decoding exist to speed it up.

Because the model only ever produces a distribution over the next token, how that token is actually chosen matters: greedy selection, beam search, or sampling methods like temperature and nucleus sampling all sit on top of the same autoregressive core.

Why business readers should care: “autoregressive” is the technical word for the left-to-right, one-token-at-a-time way chatbots write. It explains why responses stream in gradually, why output length drives cost and latency, and why the model cannot revise text it has already committed to without starting over.

Sources

Last verified June 7, 2026