Skip to content

Terminology★★★★★8 min read

Why input tokens cost less than output tokens

Reading is fast, writing is slow — the technical reason your LLM bill looks the way it does.

If you've looked at any LLM API pricing page in 2026 — Anthropic, OpenAI, Google, anyone — you've noticed input tokens cost roughly 4-5× less than output tokens. Claude Sonnet 4.6 charges $3/M input but $15/M output. GPT-5 charges $1.25/M input but $10/M output. This isn't arbitrary pricing strategy. It reflects a fundamental asymmetry in how LLMs actually run. Once you understand it, you'll make better cost decisions and stop being confused by your bill.

The technical asymmetry

When an LLM processes your input, it does it in parallel. All N input tokens go through the model at once in a single forward pass. The math is heavy but the work is highly parallelizable across GPUs and within each GPU's compute units. Modern inference servers like vLLM and TensorRT-LLM are optimized to crunch input as fast as possible.

When the LLM generates output, it has to do it one token at a time. Each generated token depends on the previous one. The model produces token #1, then includes it in the input to produce token #2, then both to produce token #3. This is sequential. You cannot parallelize generation of the next token until you have the previous one.

The result, in rough numbers:

  • Input: tens of thousands of tokens per second per GPU
  • Output: 50-200 tokens per second per GPU

That's a 100-1000× difference in 'tokens per second of GPU time'. The price difference of 4-5× is actually a discount on the real cost ratio — providers absorb some of this gap because output tokens are also where their margin lives.

What this means for your bill

A typical chat message has maybe 50 input tokens (the question) and 500 output tokens (the answer). A long-context prompt has 50,000 input tokens (a document) and 1,000 output tokens (a summary). Let's calculate.

With Claude Sonnet 4.6 ($3/M input, $15/M output):

  • Chat: 50 × $0.000003 + 500 × $0.000015 = $0.0000015 + $0.0075 = ~$0.0075. Output dominates 500×.
  • Document summary: 50,000 × $0.000003 + 1,000 × $0.000015 = $0.15 + $0.015 = ~$0.165. Input dominates 10×.

This explains a counterintuitive observation: the longer your prompts get, the more your bill skews toward input cost — even though each input token is cheap. RAG, agent loops with long history, and document analysis are input-heavy workloads.

The KV cache and prompt caching

There's a subtle further wrinkle: when you process input, the model computes 'KV cache' (key-value cache) entries for every token. These cache entries are what allow the model to attend back to earlier tokens when generating each new output token. KV cache scales linearly with input length and is the actual compute cost of input processing.

Providers have realized that if the same prompt prefix appears in many requests (a common system prompt, a long document used multiple times), they can cache the KV state and skip redoing it. This is prompt caching, and it gives you a 2nd-tier discount on input tokens that hit the cache:

  • Anthropic: cached input is ~10% the price of fresh input ($0.30/M vs $3/M for Sonnet 4.6)
  • OpenAI: cached input is 50% off (~$0.625/M for GPT-5)
  • Gemini: similar 75% discount on cached input

If your application has a long, static prompt prefix (system prompt, examples, document context) and you call the model multiple times with that prefix, prompt caching can cut your input costs by 75-90%. For RAG and agent applications, this is the single biggest cost optimization available.

Why output is so much more expensive in practice

Generating output token by token is bottlenecked by GPU memory bandwidth, not compute. Each new token requires reading the entire model's weights from memory to compute the next token's probabilities. For a 70B parameter model that's 140GB of memory reads per token (in BF16). Modern GPUs have ~3 TB/s of memory bandwidth, so that's ~50ms per token in the best case.

This is why fancier compute hardware doesn't help output speed as much as you'd expect — you're memory-bandwidth-bound, not compute-bound. The advances in output speed in 2024-2026 have mostly come from:

  • Speculative decoding — predict multiple tokens at once with a smaller model, verify with the big one
  • Continuous batching — multiple users' generations packed onto the same GPU pass
  • Larger batch sizes — amortize the memory read cost across more concurrent users
  • Faster memory (HBM3, HBM3e) — newer GPUs, more bandwidth

None of this changes the underlying asymmetry: reading input is parallel, writing output is sequential.

Practical implications for application design

Big input, small output workloads are cost-efficient. Document analysis, classification, extraction, summarization — these all play to the strengths of LLM pricing. You can feed a 100k token document and get back a 500-token answer for under $0.50.

Big output workloads are expensive. Generating long-form content (blog posts, code, novels) costs more per dollar of value because you're paying for the slow side. Be explicit about output length limits in your prompts.

Reasoning models hide output cost. GPT-5, Claude with extended thinking, DeepSeek R1 — these models generate hidden 'thinking' tokens before responding. Those tokens count as output. A reasoning answer that looks short might have used 5,000 thinking tokens you're paying for. Check the API response for reasoning_tokens or equivalent.

Streaming doesn't change cost, just perception. Streaming the response token-by-token to the user makes it feel faster but the underlying cost and latency are the same.

Async batch APIs offer huge discounts. Anthropic's Message Batches and OpenAI's Batch API are 50% off both input and output if you can wait up to 24 hours. For non-realtime use cases (overnight processing, eval runs, content generation pipelines), this is real money saved.

When this asymmetry doesn't matter

For low-volume use (under $100/month total LLM spend) — don't optimize, just build. The complexity of prompt caching and batching only pays off at scale.

For latency-critical user-facing chat — output speed dominates user experience, not cost. Pick a fast model first (Haiku, GPT-5 Mini, Gemini Flash) and worry about cost second.

For exploration and prototyping — model choice and prompt quality matter way more than cost optimization. You'll burn $10 figuring out the prompt and save $0.10 by optimizing it.

Further reading

  • LLM cost optimization — concrete techniques to cut your bill in half
  • Tokens vs words: how LLM pricing actually works — the underlying token concept
  • Prompt caching — deeper dive into the caching mechanism if you want to use it

Last updated: 2026-04-29

We use cookies

Anonymous analytics help us improve the site. You can opt out anytime. Learn more