KV cache

A cache of the Key and Value tensors from past tokens that lets transformers avoid recomputing them at each new generation step — the main reason long contexts use so much memory.

When a transformer generates token N+1, the attention mechanism needs Keys and Values from all previous tokens 1..N. Recomputing them every step would be O(N²) work. The KV cache stores these tensors after the first computation, so each new token only needs one fresh K and V plus a lookup into the cache. That turns generation from quadratic into linear per token. It matters because the KV cache is what makes generation tractable, but it's also the memory hog. For a 70B model, every token of context can take 1+ MB of cache. A 100k context = 100+ GB of KV cache, which is why long-context inference needs so much GPU memory and why providers charge more for long inputs. A concrete example: when you paste a 50k-token document into Claude or GPT and ask follow-up questions, the API caches the KV for that document — subsequent questions reuse the same prefill work. Anthropic's prompt caching feature is essentially exposing KV cache reuse to the API user, charging less for cached prefix tokens. Optimizations worth knowing: PagedAttention (vLLM) manages KV cache like virtual memory, GQA (grouped-query attention) shrinks the cache per head, and MLA (DeepSeek) compresses it further. Related: attention, context window, prefill, prompt caching.