Nucleus sampling, also called top-p sampling, is a way of choosing the next token that balances variety against coherence. It comes from the 2020 paper “The Curious Case of Neural Text Degeneration” by Holtzman, Buys, Du, Forbes, and Choi, which started from a puzzle: maximizing likelihood, as beam search does, produced text the authors found “bland and strangely repetitive,” while naively sampling from the full distribution occasionally picked a bizarre low-probability token and derailed the text. Their fix was to sample only from what they called the dynamic nucleus - “the smallest set of tokens whose cumulative probability reaches a threshold (p).”
In practice, at each step the candidate tokens are sorted by probability and the model keeps adding them to the pool until their probabilities sum to p (say 0.9), then samples from just that pool. The size of the pool changes from step to step: when the model is confident, the nucleus might hold only a token or two; when many continuations are reasonable, it widens. This adapts better than top-k sampling, which always keeps a fixed number of candidates regardless of how peaked or flat the distribution is.
Nucleus sampling is often combined with a temperature setting, which sharpens or flattens the distribution before sampling. Together, top-p and temperature are the dials most chat and writing tools expose to trade off predictability against creativity.
Why business readers should care: top-p and temperature are the knobs that make an AI’s output more focused or more freewheeling. Understanding that lower settings favor safe, repeatable answers and higher ones favor variety - at some risk of stranger output - is directly useful when tuning a product’s tone or reliability.