Skip to content

Terminology★★★★★7 min read

Temperature, top-p, top-k: sampling parameters explained

Three knobs that control how creative or boring your model output is. Most people leave them at default; the defaults are usually wrong.

Every LLM API exposes some combination of temperature, top_p, and top_k as sampling parameters. They control how the model picks the next token from a probability distribution. Most builders leave these at default forever; some default values produce significantly worse output than alternatives for specific tasks.

How models pick the next token

At every step, an LLM produces a probability distribution over its entire vocabulary (50k-200k tokens depending on the model). "Given the context so far, the next token is 'banana' with probability 0.4, 'apple' with 0.3, 'orange' with 0.1, ... ."

Sampling parameters decide which token to actually pick from this distribution. Different parameters produce different output styles:

  • Always pick the highest probability → deterministic, often boring
  • Random sample weighted by probability → varied, sometimes weird
  • Some middle ground → creative but coherent

Temperature

Temperature controls how "flat" or "peaky" the distribution becomes before sampling. Range typically 0 to 2.

  • Temperature = 0 — model always picks the highest-probability token. Output is deterministic. Same input always produces same output. Useful for tasks where consistency matters (extraction, classification, structured output).
  • Temperature = 0.3-0.5 — slight variation but still focused. Good for code generation, factual Q&A, technical writing.
  • Temperature = 0.7 — common default. Balanced creativity and coherence. Good for chat, summarization, general tasks.
  • Temperature = 1.0 — full "natural" sampling per the model's distribution. Good for creative writing, brainstorming.
  • Temperature > 1.0 — increasingly random. Useful for breaking out of patterns; quickly becomes incoherent above 1.5.

The trap: most APIs default to 0.7 or 1.0. For technical tasks, this is too high. Lower it.

Top-p (nucleus sampling)

Top-p limits sampling to the smallest set of tokens whose cumulative probability is at least p.

  • Top-p = 1.0 — disabled, considers all tokens.
  • Top-p = 0.9 — common useful value. Considers only tokens that together comprise 90% of probability mass.
  • Top-p = 0.5 — much more constrained. Only top tokens.

Top-p is dynamic: in confident situations (one token is 80% likely), it picks that one token. In uncertain situations (probability spread across many candidates), it considers more.

Top-k

Top-k limits sampling to the k highest-probability tokens.

  • Top-k = 1 — same as temperature 0 (always pick the top).
  • Top-k = 40 — common default. Considers top 40 tokens.
  • Top-k = 0 — disabled (sample from full distribution).

Top-k is less commonly exposed in modern APIs (Anthropic Claude doesn't expose it; OpenAI doesn't expose it; Gemini does).

How they combine

In practice, when multiple parameters are set:

  1. Apply temperature scaling first (reshapes the distribution)
  2. Apply top-k filtering (keep only top k tokens)
  3. Apply top-p filtering (keep only tokens cumulatively at p)
  4. Sample from what remains

Most people use temperature alone or temperature + top-p. Using all three is overkill and can produce confusing interactions.

Pragmatic settings by task

Structured output (JSON, code, classification):

  • Temperature: 0
  • Top-p: 1.0 (irrelevant when temp=0)
  • Why: you want consistent, parseable output

Technical Q&A:

  • Temperature: 0.2-0.4
  • Top-p: 0.9
  • Why: factual but allows minor variation in phrasing

Creative writing:

  • Temperature: 0.7-1.0
  • Top-p: 0.9-0.95
  • Why: variety while staying coherent

Brainstorming / ideation:

  • Temperature: 1.0-1.2
  • Top-p: 0.95
  • Why: explore the distribution, get unusual ideas

Chat (default):

  • Temperature: 0.5-0.7
  • Top-p: 0.9
  • Why: balanced; what most APIs ship as default for general use

Common mistakes

Setting temperature too high for structured tasks. "Extract the email address from this text" with temperature 0.7 sometimes returns the right email, sometimes returns invented variations. Use temperature 0 for extraction.

Setting temperature too low for creative tasks. "Generate 10 product name ideas" with temperature 0.2 returns 10 nearly-identical names. Use temperature 0.9-1.2.

Mixing top-p and top-k unnecessarily. Pick one or the other plus temperature. The interaction effects of all three are subtle and rarely worth the cognitive overhead.

Treating temperature like volume. Higher isn't "more creative," it's "more random." There's a quality cliff above ~1.2 where output becomes incoherent.

Reasoning models are different

For models like o3, DeepSeek R1, and Claude with extended thinking, sampling parameters affect the reasoning process differently. Many APIs:

  • Don't expose temperature for reasoning models
  • Recommend temperature 1.0 by default (since the model does its own structured exploration)
  • Sometimes show different best practices

If you're using a reasoning model, read the provider's specific guidance.

When NOT to think about these

For casual chat usage. The defaults are usually fine.

For production at scale. Set temperature once after testing, leave it alone.

For reasoning models. Trust the provider's recommendations.

For first-pass prototyping. Get the prompt working before tuning sampling.

When tuning matters

  • Building structured output pipelines (extraction, classification): low temperature is critical
  • Creative tools where output diversity is the product: high temperature matters
  • Evaluation: testing model behavior at multiple temperatures reveals capability vs random luck
  • Cost optimization: lower temperature can reduce token usage if you're hitting max_tokens unnecessarily

Decision tree

  • Need consistent JSON / structured output: temperature = 0
  • Factual Q&A or technical content: temperature = 0.2-0.4
  • General chat: temperature = 0.5-0.7 (default OK)
  • Creative writing or brainstorming: temperature = 0.9-1.2
  • Don't have time to think: leave defaults; come back when output is wrong

Next steps

  • Test your specific task at temperature 0, 0.5, 1.0 — pick what works
  • For production tasks, document the sampling parameters next to the prompt
  • Read about sampling for reasoning models specifically
  • Read about beam search and contrastive search (less common but useful for some tasks)

Last updated: 2026-04-29

We use cookies

Anonymous analytics help us improve the site. You can opt out anytime. Learn more