LLM Deploy
vllm-project/vllm
vllm-project/vllm· Python
High-throughput LLM inference server with PagedAttention, continuous batching, and OpenAI-compatible API.
GitHub stats
- Stars
- 78,604
- Forks
- 16,254
- Watchers
- 535
- Open issues
- 4,648
meta
- License
- Apache-2.0
- Primary language
- Python
- Last commit
- 2026-04-29
- Stats fetched at
- 2026-04-29
vLLM is a production-grade inference and serving engine for open-weight LLMs, built around PagedAttention for efficient KV cache management plus continuous batching to maximize GPU utilization. It exposes an OpenAI-compatible HTTP server, so you can swap it in behind existing client SDKs. Supports Llama, Qwen, DeepSeek, MoE models, multi-LoRA, tensor/pipeline parallelism, quantization (AWQ/GPTQ/FP8), and runs on NVIDIA, AMD ROCm, and TPU. Install via `pip install vllm` or the official Docker image.
Editor's verdict
The default choice for self-hosting open-weight LLMs at scale — when you need throughput, multi-tenant batching, and broad model coverage, vLLM is hard to beat. Pick TGI if you're deeply in the HF ecosystem, SGLang for complex structured/agentic workloads with prefix caching, or TensorRT-LLM if you've committed to NVIDIA and need every last token/sec. Skip vLLM for single-user laptop inference (use llama.cpp/Ollama) or sub-7B models on CPU. Release cadence is fast, so pin versions in production.