Temperature, top-p, top-k: sampling parameters explained

Every LLM API exposes some combination of temperature, top_p, and top_k as sampling parameters. They control how the model picks the next token from a probability distribution. Most builders leave these at default forever; some default values produce significantly worse output than alternatives for specific tasks.

How models pick the next token

At every step, an LLM produces a probability distribution over its entire vocabulary (50k-200k tokens depending on the model). "Given the context so far, the next token is 'banana' with probability 0.4, 'apple' with 0.3, 'orange' with 0.1, ... ."

Sampling parameters decide which token to actually pick from this distribution. Different parameters produce different output styles:

Always pick the highest probability → deterministic, often boring
Random sample weighted by probability → varied, sometimes weird
Some middle ground → creative but coherent

Temperature

Temperature controls how "flat" or "peaky" the distribution becomes before sampling. Range typically 0 to 2.

Temperature = 0 — model always picks the highest-probability token. Output is deterministic. Same input always produces same output. Useful for tasks where consistency matters (extraction, classification, structured output).
Temperature = 0.3-0.5 — slight variation but still focused. Good for code generation, factual Q&A, technical writing.
Temperature = 0.7 — common default. Balanced creativity and coherence. Good for chat, summarization, general tasks.
Temperature = 1.0 — full "natural" sampling per the model's distribution. Good for creative writing, brainstorming.
Temperature > 1.0 — increasingly random. Useful for breaking out of patterns; quickly becomes incoherent above 1.5.

The trap: most APIs default to 0.7 or 1.0. For technical tasks, this is too high. Lower it.

Top-p (nucleus sampling)

Top-p limits sampling to the smallest set of tokens whose cumulative probability is at least p.

Top-p = 1.0 — disabled, considers all tokens.
Top-p = 0.9 — common useful value. Considers only tokens that together comprise 90% of probability mass.
Top-p = 0.5 — much more constrained. Only top tokens.

Top-p is dynamic: in confident situations (one token is 80% likely), it picks that one token. In uncertain situations (probability spread across many candidates), it considers more.

Top-k

Top-k limits sampling to the k highest-probability tokens.

Top-k = 1 — same as temperature 0 (always pick the top).
Top-k = 40 — common default. Considers top 40 tokens.
Top-k = 0 — disabled (sample from full distribution).

Top-k is less commonly exposed in modern APIs (Anthropic Claude doesn't expose it; OpenAI doesn't expose it; Gemini does).

How they combine

In practice, when multiple parameters are set:

Apply temperature scaling first (reshapes the distribution)
Apply top-k filtering (keep only top k tokens)
Apply top-p filtering (keep only tokens cumulatively at p)
Sample from what remains

Most people use temperature alone or temperature + top-p. Using all three is overkill and can produce confusing interactions.

Pragmatic settings by task

Structured output (JSON, code, classification):

Temperature: 0
Top-p: 1.0 (irrelevant when temp=0)
Why: you want consistent, parseable output

Technical Q&A:

Temperature: 0.2-0.4
Top-p: 0.9
Why: factual but allows minor variation in phrasing

Creative writing:

Temperature: 0.7-1.0
Top-p: 0.9-0.95
Why: variety while staying coherent

Brainstorming / ideation:

Temperature: 1.0-1.2
Top-p: 0.95
Why: explore the distribution, get unusual ideas

Chat (default):

Temperature: 0.5-0.7
Top-p: 0.9
Why: balanced; what most APIs ship as default for general use

Common mistakes

Setting temperature too high for structured tasks. "Extract the email address from this text" with temperature 0.7 sometimes returns the right email, sometimes returns invented variations. Use temperature 0 for extraction.

Setting temperature too low for creative tasks. "Generate 10 product name ideas" with temperature 0.2 returns 10 nearly-identical names. Use temperature 0.9-1.2.

Mixing top-p and top-k unnecessarily. Pick one or the other plus temperature. The interaction effects of all three are subtle and rarely worth the cognitive overhead.

Treating temperature like volume. Higher isn't "more creative," it's "more random." There's a quality cliff above ~1.2 where output becomes incoherent.

Reasoning models are different

For models like o3, DeepSeek R1, and Claude with extended thinking, sampling parameters affect the reasoning process differently. Many APIs:

Don't expose temperature for reasoning models
Recommend temperature 1.0 by default (since the model does its own structured exploration)
Sometimes show different best practices

If you're using a reasoning model, read the provider's specific guidance.

When NOT to think about these

For casual chat usage. The defaults are usually fine.

For production at scale. Set temperature once after testing, leave it alone.

For reasoning models. Trust the provider's recommendations.

For first-pass prototyping. Get the prompt working before tuning sampling.

When tuning matters

Building structured output pipelines (extraction, classification): low temperature is critical
Creative tools where output diversity is the product: high temperature matters
Evaluation: testing model behavior at multiple temperatures reveals capability vs random luck
Cost optimization: lower temperature can reduce token usage if you're hitting max_tokens unnecessarily

Decision tree

Need consistent JSON / structured output: temperature = 0
Factual Q&A or technical content: temperature = 0.2-0.4
General chat: temperature = 0.5-0.7 (default OK)
Creative writing or brainstorming: temperature = 0.9-1.2
Don't have time to think: leave defaults; come back when output is wrong

Next steps

Test your specific task at temperature 0, 0.5, 1.0 — pick what works
For production tasks, document the sampling parameters next to the prompt
Read about sampling for reasoning models specifically
Read about beam search and contrastive search (less common but useful for some tasks)