Top-p (nucleus) sampling

A sampling method that picks each next token from the smallest set whose cumulative probability is ≥ p — adaptive to how confident the model is.

Top-p sampling (also called nucleus sampling) restricts the model's choices to the smallest set of tokens whose cumulative probability adds up to at least p (typically 0.9 or 0.95), then samples from that set. When the model is confident, the "nucleus" might be just 2-3 tokens. When it's uncertain, the nucleus expands to include 50+ tokens. It matters because top-p adapts to context in a way that fixed-k methods can't. Top-k 50 always considers 50 tokens — even when the model already strongly preferred one — which can produce noise. Top-p only widens its consideration when the model itself is unsure, leading to more natural and consistent output. A concrete example: completing "The capital of France is" — the model has 99%+ probability on "Paris". Top-p 0.9 picks Paris. Completing "My favorite color is" — probability is spread across red, blue, green, etc. Top-p 0.9 considers all the common color words. Top-k 50 would include obscure tokens with 0.0001% probability in both cases, which is wasteful. Most APIs let you set both temperature and top-p. Common defaults: temperature 0.7-1.0, top-p 0.9-0.95. For deterministic output (code, JSON), use temperature 0 — top-p becomes irrelevant. Related: temperature, top-k, sampling, decoding.