Skip to content

Intro★★★★5 min read

What is a context window? The hidden ceiling of every LLM

The context window is the amount of text the model can see at once. Bigger windows enabled the long-document era, but they don't solve every problem — and they cost real money.

Every LLM has a hard limit on how much text it can process in a single request. That limit is the context window. Hit the limit and the model can't see the parts of your input that got pushed out — they don't exist for it. Knowing the size of the window and what fits inside is the difference between an app that works and one that mysteriously "forgets" things.

The unit is tokens, not words

Context windows are measured in tokens. As a rough rule of thumb in English, 1,000 tokens ≈ 750 words; in Chinese, 1,000 tokens ≈ 500 to 700 characters depending on the tokenizer. Code is usually closer to 1 token per 3-4 characters.

In 2026 the typical context window sizes are:

  • GPT-5 / Claude Sonnet / Gemini 2.5 Pro: 200K tokens (Claude) to 1M+ tokens (Gemini long-context). 200K tokens is roughly a 500-page book.
  • Smaller / cheaper models: 32K to 128K (still big enough for most tasks).
  • Local / open-weight 7B-13B models: 8K to 32K, sometimes more with tricks like RoPE scaling.

The window has to fit everything: system prompt, conversation history, your latest message, and the model's response (yes — output tokens count against the window too in many APIs).

What the window actually contains

When you call an LLM, the prompt is the literal sequence of tokens going in. For a chat app, that's:

  1. The system prompt (set by the product).
  2. Every previous user/assistant message in the conversation.
  3. Your latest message.
  4. Reserved space for the response (often 4K-8K tokens).

If that total exceeds the window, the API will reject the request, or — in chat apps — the product will silently drop earlier messages to make room. This is why long conversations "lose memory": the early messages literally aren't being sent anymore.

Bigger windows don't solve everything

A few years ago, fitting a full book into a prompt sounded like science fiction. In 2026, Gemini Pro can handle 1-2 million tokens and Claude can hit 1M with extended context. Yet long context isn't a free win.

Quality degrades with distance. Models are better at finding facts near the start or end of a long context (the so-called "lost in the middle" effect). If you paste a 500K-token document and ask a question whose answer lives in the middle, accuracy drops measurably.

It's expensive. Modern APIs charge per input token. A single 200K-token Claude Sonnet call can cost over $1, and you pay for that whole prompt every single turn unless you use prompt caching. For high-traffic apps, the bill can dwarf compute and storage costs combined.

It's slow. A 200K-token prompt can take 15-30 seconds just to process before the first output token appears. For interactive UIs, that's deadly.

RAG often beats long context. Putting only the relevant 5,000 tokens in a context window — found via search — usually gives better answers than dumping the whole 500,000 tokens in. RAG systems exist precisely because long context isn't the right answer for most retrieval problems.

How to think about your context budget

When designing an LLM-powered feature, plan a context budget like a server's memory:

  • Reserve room for output. Decide how long the model's answer can be, multiply by 1.2 for safety.
  • Cap conversation history. Truncate or summarize older messages once you exceed a threshold.
  • Inject only what's needed. Use RAG to fetch the 3-10 most relevant chunks, not the whole knowledge base.
  • Use prompt caching. Both Anthropic and OpenAI offer prompt caching where reused prefixes are charged at ~10% of normal price. For long system prompts and shared documents, this is a 5-10× cost saver.

When the window IS the right answer

Sometimes long context beats RAG. Use the long-context approach when:

  • The document is small enough to fit and the question requires the whole thing (e.g., "summarize this contract")
  • You'd lose precision by chunking (legal, medical reasoning where context across sections matters)
  • The query is one-off and you don't want to maintain a vector store
  • You need traceable reasoning over the entire input — RAG can miss key pieces silently

When NOT to rely on a big window

  • Repetitive queries against the same knowledge base. RAG with caching is cheaper and faster.
  • Latency-sensitive UI. Stream-first products want short prompts.
  • Anything where you'd be tempted to dump 500 documents in. Don't. Search first, then prompt with the top-K.

Further reading

  • What is a token in LLM-speak
  • What is RAG (Retrieval-Augmented Generation)
  • Tokens vs words: how LLM pricing actually works
  • LoRA vs fine-tuning vs RAG: which solves which problem
  • Why input tokens cost less than output tokens

Last updated: 2026-04-29

We use cookies

Anonymous analytics help us improve the site. You can opt out anytime. Learn more