The maximum number of tokens an LLM can read and reason over in a single call — covering the system prompt, conversation history, and any attached documents.
The context window is the maximum number of tokens an LLM can take in a single forward pass. Anything you put in — system prompt, conversation history, retrieved documents, the user's question — counts toward this limit. If you exceed it, you have to truncate or summarize. Modern frontier models offer 200k (Claude), 1M (Gemini), or even 2M-token contexts; older or smaller models may have 4k-32k.
It matters because context determines what kind of work the model can do without external memory. With a 4k window, you fit a few pages of text. With 200k, an entire codebase or a small book. With 1M+, multiple books, hours of meeting transcripts, or huge legal corpora. RAG was invented partly because windows used to be small; longer windows reduce (but don't eliminate) the need for retrieval.
A concrete example: feeding a 100k-token codebase into Claude lets you ask "refactor the auth module to use JWT instead of sessions" and get a coherent multi-file edit plan. The same task with a 4k window would require chunking, retrieval, and orchestration code.
Caveats: just because a model has a 200k window doesn't mean it uses all 200k effectively. "Lost in the middle" is a known issue — models attend better to the start and end than the middle. Long-context evals (needle-in-haystack, RULER) measure this. Cost also scales with input size. Related: KV cache, attention, RAG, lost in the middle.
We use cookies
Anonymous analytics help us improve the site. You can opt out anytime. Learn more