Speculative decoding

An inference speed-up where a small "draft" model proposes several tokens and a large model verifies them in parallel — making LLM generation 2-3× faster with no quality loss.

Speculative decoding speeds up LLM inference without changing the output. A small "draft" model (cheap to run) generates the next 4-8 tokens; the large "target" model then verifies them all in a single forward pass. Tokens the target agrees with are accepted; the first disagreement is corrected and the rest discarded. Then repeat. It matters because LLM generation is bottlenecked by the sequential nature of decoding — each token requires a full forward pass through the model. By batching the target model's checks, you trade some compute for fewer sequential passes, getting 2-3× wall-clock speedup with mathematically identical outputs to the target model alone. A concrete example: serving Llama 3 70B alone might generate 40 tokens/sec. Pair it with a Llama 3 8B draft model and you might get 100 tokens/sec for the same exact output distribution. The technique works best when draft and target agree often, so the draft is usually a smaller version of the same family. Most production inference engines now do this by default: vLLM, TensorRT-LLM, llama.cpp all support it. OpenAI, Anthropic, and Google's APIs almost certainly use some form of speculative decoding internally. Related: KV cache, inference, draft model, batching.