Skip to content

LLM Deploy

vllm-project/vllm

vllm-project/vllm· Python

High-throughput LLM inference server with PagedAttention, continuous batching, and OpenAI-compatible API.

GitHub stats

Stars
78,604
Forks
16,254
Watchers
535
Open issues
4,648

meta

License
Apache-2.0
Primary language
Python
Last commit
2026-04-29
Stats fetched at
2026-04-29

vLLM is a production-grade inference and serving engine for open-weight LLMs, built around PagedAttention for efficient KV cache management plus continuous batching to maximize GPU utilization. It exposes an OpenAI-compatible HTTP server, so you can swap it in behind existing client SDKs. Supports Llama, Qwen, DeepSeek, MoE models, multi-LoRA, tensor/pipeline parallelism, quantization (AWQ/GPTQ/FP8), and runs on NVIDIA, AMD ROCm, and TPU. Install via `pip install vllm` or the official Docker image.

Editor's verdict

The default choice for self-hosting open-weight LLMs at scale — when you need throughput, multi-tenant batching, and broad model coverage, vLLM is hard to beat. Pick TGI if you're deeply in the HF ecosystem, SGLang for complex structured/agentic workloads with prefix caching, or TensorRT-LLM if you've committed to NVIDIA and need every last token/sec. Skip vLLM for single-user laptop inference (use llama.cpp/Ollama) or sub-7B models on CPU. Release cadence is fast, so pin versions in production.

Last updated: 2026-04-29

We use cookies

Anonymous analytics help us improve the site. You can opt out anytime. Learn more

vllm-project/vllm · BuilderWorld