LLM Deploy

ollama/ollama

ollama/ollama· Go

One-command local runner for open-weight LLMs, with auto GPU offload and an OpenAI-compatible API.

GitHub stats

Stars: 170,328
Forks: 15,878
Watchers: 959
Open issues: 3,105

meta

License: MIT
Primary language: Go
Last commit: 2026-04-29
Stats fetched at: 2026-04-29

Ollama wraps llama.cpp (and now its own engine) into a Docker-style CLI: `ollama pull qwen3` then `ollama run` and you're chatting. It handles GGUF model downloads, GPU/CPU offload, quantization variants, and exposes a local REST + OpenAI-compatible API on port 11434. Modelfiles let you bake system prompts, parameters, and adapters into a reusable tag. Works on macOS (Metal), Linux, and Windows.

Editor's verdict

Pick Ollama when you want the fastest path from "I have a Mac/PC" to "I have a working local LLM endpoint" — it's the default for local dev, demos, and Cursor/Continue/Open WebUI backends. Don't use it for production multi-tenant serving: throughput and batching lose badly to vLLM or SGLang, and you pay a tax for the GGUF-only ecosystem. If you need raw tokens/sec on an H100, go vLLM; if you need portability and zero-config, stay here.