LLM Deploy

vllm-project/vllm

vllm-project/vllm· Python

High-throughput LLM inference server with PagedAttention, continuous batching, and OpenAI-compatible API.

GitHub stats

Stars: 78,604
Forks: 16,254
Watchers: 535
Open issues: 4,648

meta

License: Apache-2.0
Primary language: Python
Last commit: 2026-04-29
Stats fetched at: 2026-04-29

vLLM is a production-grade inference and serving engine for open-weight LLMs, built around PagedAttention for efficient KV cache management plus continuous batching to maximize GPU utilization. It exposes an OpenAI-compatible HTTP server, so you can swap it in behind existing client SDKs. Supports Llama, Qwen, DeepSeek, MoE models, multi-LoRA, tensor/pipeline parallelism, quantization (AWQ/GPTQ/FP8), and runs on NVIDIA, AMD ROCm, and TPU. Install via `pip install vllm` or the official Docker image.

Editor's verdict

The default choice for self-hosting open-weight LLMs at scale — when you need throughput, multi-tenant batching, and broad model coverage, vLLM is hard to beat. Pick TGI if you're deeply in the HF ecosystem, SGLang for complex structured/agentic workloads with prefix caching, or TensorRT-LLM if you've committed to NVIDIA and need every last token/sec. Skip vLLM for single-user laptop inference (use llama.cpp/Ollama) or sub-7B models on CPU. Release cadence is fast, so pin versions in production.