Top-k sampling

A sampling method that restricts each next-token choice to the k highest-probability tokens — simpler but less adaptive than top-p.

Top-k sampling restricts the model's next-token choices to the k tokens with the highest probability, then samples from that subset (often weighted by probability). With k=1, you always pick the most likely token — equivalent to greedy / temperature 0. With k=50, you consider the 50 most likely options regardless of how uneven the probability distribution is. It matters because it's the simplest cap on output randomness. Without any cap, even a 0.0001%-probability token can occasionally be sampled, leading to weird tokens or breakdowns. Top-k filters those out. A concrete example: completing "once upon a time, there was a". With temperature 1 and no top-k, you might rarely sample obscure or syntactically wrong tokens. Top-k 40 ensures you only choose from the 40 most likely continuations — "king", "princess", "young", "little" etc., all reasonable. The weakness vs top-p: top-k uses the same number of candidates regardless of how confident the model is. When the model strongly knows the next word ("capital of France is..."), top-k 50 wastes consideration on unlikely tokens. When the model is genuinely uncertain (open-ended creative writing), top-k 50 might be too restrictive. Most modern APIs default to top-p instead, but some use both. Related: top-p, temperature, sampling, decoding.