Most teams that say "we should self-host" haven't done the math. The Anthropic API at $3/M input + $15/M output is cheap until you hit consistent high-throughput traffic, and even then the math doesn't always favor self-hosting once you account for ops time and tail latency.
But when self-hosting does win — fine-tuned models, regulated industries, very high QPS — vLLM is the production-grade choice in 2026. This post is the full setup, tuning playbook, and benchmarks that match real workloads.
When self-hosting beats API
Run the napkin math first:
- Frontier model (Claude 4.7 Opus, GPT-5): API basically always wins until you're spending $30k+/month and have heavy reasoning load. Frontier models are not self-hostable in any practical sense.
- Mid-tier (Claude Haiku, GPT-5 mini): API wins until ~$10k/month at steady throughput.
- Open-weight (Llama 70B, Mixtral 8x22B, Qwen 2.5 72B): Self-hosting on a rented H100 is ~$2/hr. If you can keep the GPU busy >50% of the time, self-hosting beats API equivalents.
- Small open-weight (Llama 8B, Phi-4, Qwen 7B): Self-hosting always wins on cost if you have 1000+ req/day. Use quantized models on a single 24GB GPU.
- Custom fine-tuned model: Self-hosting is the only option (no API will serve your adapter).
The other reasons self-hosting wins, regardless of cost:
- Privacy / compliance. Healthcare, finance, government — sometimes the data can't leave your network at all.
- Custom kernel work. You want speculative decoding, KV cache sharing across requests, custom attention mechanisms.
- Latency floor. API has a 200-400ms baseline including network. Self-hosted on the same datacenter as your app: 50ms.
Why vLLM specifically
The inference server space in 2026:
- vLLM. The community standard. Best throughput-per-GPU, widest model support, best dynamic batching. UC Berkeley project, very active.
- SGLang. Rising star. Strong on structured outputs and agent workflows. Slightly faster than vLLM on some workloads.
- TensorRT-LLM (NVIDIA). Highest single-stream latency wins. Painful to set up.
- TGI (HuggingFace). Production-stable, slightly behind vLLM on raw throughput. Good Docker story.
- Ollama / llama.cpp. Great for laptops and small dev workflows. Not for production.
For most teams in 2026: vLLM. The performance gap with the alternatives is small, and the docs/community make it the path of least resistance.
Setup: from zero to serving
Minimum viable production setup, single H100 80GB serving Llama 3.3 70B Instruct:
# 1. Install on Linux (Ubuntu 22.04+)
pip install vllm
# 2. Start the server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching \
--port 8000
That's it. You now have an OpenAI-compatible API at http://localhost:8000/v1. Point any OpenAI SDK at it by setting base_url.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="local")
response = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct",
messages=[{"role": "user", "content": "Hello"}],
)
The flags that matter
The defaults are not optimal for production. The critical ones:
--gpu-memory-utilization. Defaults to 0.9. Bump to 0.92-0.95 for more KV cache, lower for shared GPUs. Higher = more concurrent requests, more OOM risk.--max-model-len. The context window you'll allow. Lower = more KV cache per request = higher throughput. If your real prompts are 2k tokens, set this to 4096, not 128000.--max-num-seqs. How many requests can be batched together. Default 256 is generous; tune based on memory.--enable-prefix-caching. ALWAYS enable. If you have a long system prompt that repeats, this caches it. 2-5× throughput gain on repeated workloads.--tensor-parallel-size. Number of GPUs the model is split across. 1 for 70B fitting in 80GB, 2 for 70B fp16 fitting in 2× A100 40GB.--quantization fp8or--quantization awq. Use a quantized model for 2× throughput at small quality cost. fp8 on H100, awq for Ada GPUs.--enable-chunked-prefill. Splits long prompts into chunks. Smoother latency for mixed short/long requests.
Throughput benchmarks (real numbers)
On a single H100 80GB serving Llama 3.3 70B Instruct, fp16, with prefix caching, my measured numbers:
- Sequential (1 req at a time): ~85 tokens/sec generation. Good for low-latency single-stream.
- Concurrent (32 in flight): ~6,400 tokens/sec aggregate generation. This is dynamic batching at work.
- With fp8 quantization, 32 concurrent: ~10,000 tokens/sec. ~50% throughput gain, marginal quality drop on most tasks.
- Cost equivalent to API: at 6,400 tok/s on $2.50/hr H100 = $0.11/M tokens. Compare to GPT-5 mini at $0.40/M output. Self-hosting wins ~3.5× on $/token if you can keep the GPU busy.
If your average GPU utilization is 30%, the cost advantage shrinks to roughly break-even with API.
Production deployment patterns
A real deployment in 2026 looks like:
- vLLM in a Docker container. Use the official
vllm/vllm-openaiimage. Pin the version. - Behind a load balancer. Multiple vLLM instances if you need >1 GPU's throughput. Round-robin works fine; sticky sessions only if you're using KV cache sharing.
- Health checks. vLLM has
/healthand/metricsendpoints. Wire them into your monitoring. - Autoscaling. Cold start for vLLM is 60-120 seconds (model loading). Scale based on request queue depth, not CPU.
- Fallback to API. Always have a fallback to API for when your GPUs are saturated or down. Don't be one outage from production failure.
Managed options if you don't want to run hardware: Modal, RunPod, Fireworks, Baseten. They charge a markup over raw GPU but handle ops.
What goes wrong in production
Things that ate my weekend so they don't eat yours:
- OOM during long prompts. A request with 30k context + 4k output can OOM mid-generation. Set
--max-model-lento a value your VRAM can handle, even if the model supports more. - Tokenizer mismatch. Don't pip install a different transformers version than vLLM expects. Pin exact versions.
- Latency spikes during prefill. Long prompts hold up the batch.
--enable-chunked-prefillhelps; setting--num-scheduler-steps 8helps further on heavy mixed traffic. - CUDA OOM after running fine. Memory fragmentation. Restart the server when this happens; investigate
--enforce-eagerif frequent. - Adapter loading. vLLM 2026 supports LoRA adapter swapping per request via
--enable-lora. Each adapter takes ~200MB GPU memory.
When NOT to self-host
- You have <1000 requests/day. API is cheaper, simpler, more reliable.
- You want frontier quality. No open-weight model in 2026 matches Claude 4.7 Opus or GPT-5 on hard reasoning tasks.
- You can't tolerate downtime. A self-hosted single-GPU setup will have outages. Multi-GPU + redundancy = real ops investment.
- You don't have GPU expertise on the team. vLLM is straightforward but the failure modes (CUDA mismatches, NCCL errors, kernel crashes) need someone comfortable in Linux and CUDA.
Further reading
- vLLM official docs and tuning guide.
- Efficient Memory Management for Large Language Model Serving with PagedAttention — the original vLLM paper.
- SGLang docs (the main alternative worth knowing).
- Look up: prefix caching, speculative decoding, paged attention, continuous batching.