Speculative decoding: how to make inference 2-3× faster

If you've benchmarked LLM inference and found that latency is dominated by sequential token generation — each token requires a full forward pass through the model — you've found the right opportunity for speculative decoding. It's the single most impactful inference optimization of the past three years. Used by OpenAI, Anthropic, and Google internally. Available in vLLM, SGLang, and TGI for self-hosted setups.

The insight is elegant: most tokens an LLM generates are easy. They could have been guessed by a much smaller model. So why not let a small model guess, and only spend big-model compute when verification is needed?

The fundamental bottleneck

A 70B parameter model on an H100 generates roughly 80-100 tokens per second in single-stream serving. Why so slow?

A forward pass through 70B parameters reads ~140GB of weights from VRAM (in fp16).
HBM bandwidth on H100 is ~3TB/s.
Theoretical floor: 140 / 3000 = 47ms per forward pass.
One forward pass = one token. Real-world overhead pushes this to 10-12ms.
1000ms / 10ms = 100 tokens/sec.

Generating each token is a memory-bandwidth problem, not a compute problem. The GPU is mostly idle (compute-wise) while waiting for weights to stream from VRAM. If you could verify multiple tokens in a single forward pass, you'd amortize the memory cost.

Speculative decoding does exactly that.

The algorithm

The core idea, by Leviathan et al. (2022) and Chen et al. (2023):

A draft model (small, fast) generates K candidate tokens autoregressively. Cheap.
The target model (big, slow) processes the original prompt + the K draft tokens in parallel — one forward pass.
Compare the target model's distributions at each position to the draft model's. Accept the longest prefix where the draft was "close enough" (probabilistically).
For the first rejected position, sample from the target model's distribution.
Discard the rest of the draft.

Net result: per target-model forward pass, you accept ~2-4 tokens instead of 1. Even after the cost of running the draft model, you net 2-3× speedup.

The acceptance test (Chen et al's lossless variant) is mathematically equivalent to sampling from the target model alone — the output distribution is identical. This is not an approximation. You get the same answer, just faster.

Why it works

Most text is predictable. "The capital of France is Pa" — the next token is almost certainly "ris." A small model gets this right easily. The big model gets called only at the genuinely uncertain branches: rare names, technical terms, the first few tokens of a creative answer.

In practice, on common workloads, 60-80% of draft tokens get accepted. With 4 draft tokens per round, you average ~3 accepted per target forward pass. ~3× speedup.

Picking a draft model

The draft model needs to be:

Same tokenizer as the target. Otherwise you can't compare distributions.
Much smaller than the target. 7B-1B drafting for 70B targets is the sweet spot. The draft should run 5-10× faster than the target.
Reasonably aligned in distribution. A draft model with very different fine-tuning from the target will have low acceptance rates (< 30%) and the speedup vanishes.

Standard pairings in 2026:

Llama 3.3 70B (target) + Llama 3.2 1B (draft). ~3× speedup typical.
Qwen 2.5 72B + Qwen 2.5 0.5B. ~2.5× typical.
Claude / GPT-5 / Gemini 3. Providers run their own internal draft models; you don't pick.

Variants you'll hear about

EAGLE / EAGLE-2. A draft model that's not a separate small LLM but an extra head trained on top of the target's hidden states. Higher acceptance rate (often 70%+), trained jointly with target. The 2025 default for serious self-hosters.
Medusa. Multiple prediction heads attached to target, each predicting K positions ahead. Cheaper than separate draft, slightly worse acceptance.
Lookahead decoding. Self-speculation: the model proposes its own continuations using its own past activations. No draft model needed.
Tree attention / multi-draft. Generate multiple draft sequences in a tree, accept the best path. More complex; ~10-20% additional speedup.

For most teams in 2026, EAGLE-2 is the practical sweet spot: best speedup, well-supported in vLLM and SGLang, easy to configure.

Setup in vLLM (real example)

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --speculative-model meta-llama/Llama-3.2-1B-Instruct \
  --num-speculative-tokens 5 \
  --use-v2-block-manager

That's it. Single flag pair turns it on. vLLM 2026 has EAGLE support too:

  --speculative-model yuhuili/EAGLE-Llama-3-70B \
  --num-speculative-tokens 5

Monitor the acceptance rate in the logs. If it's < 50%, the draft is mismatched — try a different draft model.

Tradeoffs

Speculative decoding isn't free. Three real costs:

Memory. You're now hosting two models on the same GPU(s). A 70B + 1B drafter takes 5-10% extra VRAM. EAGLE adds less (just an extra head).
Throughput vs latency. Speculative decoding optimizes single-stream latency. For high-batch throughput serving, it can hurt — the GPU was already saturated with parallel work, and adding draft computation steals capacity. vLLM auto-disables it under high load in some configurations.
Setup complexity. Picking the right draft, getting the tokenizer to match, configuring acceptance rates. Add 1-2 days the first time.

When NOT to use speculative decoding

High concurrent traffic, throughput is the goal. Pure batching is more efficient.
Very low-latency requirement and you don't need 70B. Just use a smaller model end-to-end.
Tiny models (7B and below). The draft model overhead eats the gains.
You're using a hosted API. Providers manage this internally. You can't enable it on the OpenAI/Anthropic/Gemini API explicitly; they decide based on traffic patterns.

Real numbers

My measured speedups on Llama 3.3 70B with various drafts on a single H100:

No speculative: 85 tok/sec single stream.
- Llama 3.2 1B drafter, 5 tokens: 220 tok/sec. ~2.6× speedup. 65% accept rate.
- EAGLE-Llama-3 head, 5 tokens: 280 tok/sec. ~3.3× speedup. 73% accept rate.
High concurrency (16 streams) with EAGLE: aggregate throughput drops 5% vs no spec. (Don't use it for batched serving.)

For user-facing chat where latency matters, this is the difference between feeling fast and feeling slow. With speculative, a 70B feels like a 30B. Without it, the latency floor is what it is.

What's coming

The 2026 frontier of inference optimization includes:

Prompt-level speculation. The user's previous turn becomes a partial draft for tool use scenarios.
Multi-token prediction (MTP) at training time. Models like DeepSeek V3 are trained to predict multiple tokens at each position natively, making speculative decoding even more effective.
Parallel sampling with verification. Generate K full candidate responses in parallel, pick the best — a different speedup axis.

Speculative decoding has become table stakes. If you're self-hosting and your latency matters, you should be using it.