LLM Deploy
ollama/ollama
ollama/ollama· Go
One-command local runner for open-weight LLMs, with auto GPU offload and an OpenAI-compatible API.
GitHub stats
- Stars
- 170,328
- Forks
- 15,878
- Watchers
- 959
- Open issues
- 3,105
meta
- License
- MIT
- Primary language
- Go
- Last commit
- 2026-04-29
- Stats fetched at
- 2026-04-29
Ollama wraps llama.cpp (and now its own engine) into a Docker-style CLI: `ollama pull qwen3` then `ollama run` and you're chatting. It handles GGUF model downloads, GPU/CPU offload, quantization variants, and exposes a local REST + OpenAI-compatible API on port 11434. Modelfiles let you bake system prompts, parameters, and adapters into a reusable tag. Works on macOS (Metal), Linux, and Windows.
Editor's verdict
Pick Ollama when you want the fastest path from "I have a Mac/PC" to "I have a working local LLM endpoint" — it's the default for local dev, demos, and Cursor/Continue/Open WebUI backends. Don't use it for production multi-tenant serving: throughput and batching lose badly to vLLM or SGLang, and you pay a tax for the GGUF-only ecosystem. If you need raw tokens/sec on an H100, go vLLM; if you need portability and zero-config, stay here.