Skip to content

How to pick★★★★9 min read

How to self-host an LLM stack on a single GPU box in 2026

vLLM, Ollama, LM Studio, LocalAI — pick the right tool for whether this is a hobby, a side project, or a production workload.

Self-hosting an LLM in 2026 is genuinely accessible. A 4090 runs a quantized 70B model. An H100 runs the same unquantized. The hardware is the easy part — the software stack you pick determines whether self-hosting is fun, productive, or a full-time job.

Hobby tier: Ollama

For evening tinkering and learning, Ollama is the right starting point. One command (ollama run llama3.1) downloads and runs any popular model. It handles GGUF quantization, GPU detection, model storage, and an OpenAI-compatible API automatically. The DX is genuinely good.

Use Ollama when: you're learning, you're prototyping, you have a side project that doesn't need to scale beyond your own machine, you want to play with multiple models without commitment.

Weakness: not built for high throughput. Concurrent requests serialize. No batching. The OpenAI-compatible API is incomplete. Don't put it in front of real users.

Hobby tier with GUI: LM Studio

LM Studio is Ollama's GUI cousin — a Mac/Windows/Linux desktop app that does everything Ollama does plus model search, chat UI, and a server you can point your scripts at. Easier for non-engineers, slightly less hackable than Ollama for engineers.

Use LM Studio when: you want a chat UI on your local model, you want to download models without wrestling with Hugging Face, you want a friend or non-engineer collaborator to use a local model.

Side-project tier: text-generation-webui, KoboldCpp, llama.cpp directly

More control than Ollama, less polish. text-generation-webui (oobabooga) is the kitchen sink — supports more model formats, finer sampling controls, role-playing optimizations. KoboldCpp is the storytelling/RP-focused fork. llama.cpp itself is what most of these wrap.

Use these when: you want to tweak sampling parameters, you're doing roleplay/character chat as a hobby, you specifically need GGUF support, you want a project to learn from.

Production tier: vLLM

vLLM is what serious deployments use. Continuous batching, paged attention, tensor parallelism across multiple GPUs, FP16 / INT8 / INT4 quantization, OpenAI-compatible API. Throughput is dramatically higher than Ollama or text-gen-webui.

Use vLLM when: you're putting a model behind a real product, you have multiple concurrent users, you need predictable latency, you have at least one engineer who can read documentation carefully.

Weakness: not as plug-and-play. You'll wrestle with GPU memory allocation, model loading, and config the first time. The reward is a server that handles 100+ concurrent requests without breaking a sweat.

Production tier alternatives: TGI, llama.cpp server, Modal

  • HuggingFace TGI (Text Generation Inference) — vLLM's main competitor, also good. Slightly different optimization choices. Pick whichever has better support for the specific models you use.
  • llama.cpp server — for CPU-only or modest GPU deployments. Slower than vLLM but works on hardware that vLLM rejects.
  • Modal / RunPod / Together — managed self-hosting. You don't run the GPU; you run code that runs on theirs. Great middle ground when you want self-hosting flexibility without owning hardware.
  • SGLang — newer competitor to vLLM, sometimes faster on specific workloads. Worth comparing if you're throughput-bound.

Hardware: what to actually buy

For a single-machine setup in 2026:

  • RTX 4090 (24GB) — runs 70B Q4 quantized comfortably, 13B unquantized, all 7B models freely. ~$1800 used.
  • RTX 5090 (32GB) — better than 4090, runs 70B Q5/Q6 quantized. ~$2500-3000 retail.
  • 2× RTX 4090 — runs 70B at higher quality across both cards via tensor parallelism. ~$3600 used + decent PSU.
  • A100 / H100 — overkill for solo, makes sense for small teams. Used A100 80GB ~$8-12k.
  • Mac Studio M3 Ultra — surprisingly capable thanks to unified memory. 192GB shared memory means it runs 70B unquantized. Slow per-token but no other 2026 desktop runs that model size at all on consumer hardware.

For cloud GPU (no hardware purchase): Lambda Labs, RunPod, Together, or Modal are all solid. Per-hour pricing means you pay only when you use it, but a 24/7 server makes owned hardware cheaper after about 6 months.

Quantization choices

GGUF formats (Q4_K_M, Q5_K_M, Q6_K, Q8_0) trade quality for size. The rough rule:

  • Q8_0: nearly lossless, 50% size reduction
  • Q6_K: very small quality loss, 60% size reduction
  • Q5_K_M: noticeable but minor quality loss, 65% reduction
  • Q4_K_M: meaningful quality loss but acceptable, 70% reduction
  • Below Q4: significant degradation, only for desperately resource-constrained use

For production self-hosting, Q5_K_M or Q6_K is the sweet spot. AWQ and GPTQ are alternative quantization formats vLLM supports — different trade-offs but similar end results.

When NOT to self-host

Spending $300/month on Anthropic API and considering self-host? Don't. Hardware is $2000+, electricity is real, your time is real. Stay on API.

Do you have ops capacity? If your team is one engineer who is also responsible for the product, don't add a self-hosted GPU server to the mix. The day it goes down at 2am is not the day to be learning vLLM internals.

Does the model work for your use case? Run experiments on the API first. If Claude or GPT works for the product, switching to self-hosted Llama means accepting some quality drop, often more than you expect. Don't move to self-hosted as your first cost optimization.

A practical setup recipe

For a small production deployment:

  • Hardware: 1× H100 (rented or owned), 64GB system RAM, NVMe SSD
  • OS: Ubuntu 22.04 LTS
  • Inference: vLLM with --enable-prefix-caching and --max-model-len 16384
  • Model: Llama 3.1 70B Instruct in FP8 quantization
  • Reverse proxy: Caddy or nginx for TLS termination
  • Monitoring: Prometheus + Grafana for token throughput and GPU utilization
  • Logging: structured JSON logs to whatever you already use
  • Authentication: simple API keys or OAuth — don't expose unauthenticated

This stack handles 50-100 concurrent users comfortably for typical chat-style workloads.

Decision tree

  • Hobby, single user: Ollama or LM Studio
  • Side project, want to tinker: text-gen-webui or llama.cpp
  • Real product, real users, owned hardware: vLLM + owned GPU
  • Real product, no hardware budget: Modal, Together, or RunPod
  • Mac-only, large model: Ollama + Mac Studio M3 Ultra
  • CPU only: llama.cpp server

Next steps

  • Read about specific quantization formats: GGUF Q4 vs Q5 vs Q6
  • Look into LoRA inference for serving fine-tuned variants of base models
  • Read about vLLM specific tuning: max_num_seqs, gpu_memory_utilization
  • Set up monitoring early — you'll want it before something breaks

Last updated: 2026-04-29

We use cookies

Anonymous analytics help us improve the site. You can opt out anytime. Learn more