How to self-host an LLM stack on a single GPU box in 2026

Self-hosting an LLM in 2026 is genuinely accessible. A 4090 runs a quantized 70B model. An H100 runs the same unquantized. The hardware is the easy part — the software stack you pick determines whether self-hosting is fun, productive, or a full-time job.

Hobby tier: Ollama

For evening tinkering and learning, Ollama is the right starting point. One command (ollama run llama3.1) downloads and runs any popular model. It handles GGUF quantization, GPU detection, model storage, and an OpenAI-compatible API automatically. The DX is genuinely good.

Use Ollama when: you're learning, you're prototyping, you have a side project that doesn't need to scale beyond your own machine, you want to play with multiple models without commitment.

Weakness: not built for high throughput. Concurrent requests serialize. No batching. The OpenAI-compatible API is incomplete. Don't put it in front of real users.

Hobby tier with GUI: LM Studio

LM Studio is Ollama's GUI cousin — a Mac/Windows/Linux desktop app that does everything Ollama does plus model search, chat UI, and a server you can point your scripts at. Easier for non-engineers, slightly less hackable than Ollama for engineers.

Use LM Studio when: you want a chat UI on your local model, you want to download models without wrestling with Hugging Face, you want a friend or non-engineer collaborator to use a local model.

Side-project tier: text-generation-webui, KoboldCpp, llama.cpp directly

More control than Ollama, less polish. text-generation-webui (oobabooga) is the kitchen sink — supports more model formats, finer sampling controls, role-playing optimizations. KoboldCpp is the storytelling/RP-focused fork. llama.cpp itself is what most of these wrap.

Use these when: you want to tweak sampling parameters, you're doing roleplay/character chat as a hobby, you specifically need GGUF support, you want a project to learn from.

Production tier: vLLM

vLLM is what serious deployments use. Continuous batching, paged attention, tensor parallelism across multiple GPUs, FP16 / INT8 / INT4 quantization, OpenAI-compatible API. Throughput is dramatically higher than Ollama or text-gen-webui.

Use vLLM when: you're putting a model behind a real product, you have multiple concurrent users, you need predictable latency, you have at least one engineer who can read documentation carefully.

Weakness: not as plug-and-play. You'll wrestle with GPU memory allocation, model loading, and config the first time. The reward is a server that handles 100+ concurrent requests without breaking a sweat.

Production tier alternatives: TGI, llama.cpp server, Modal

HuggingFace TGI (Text Generation Inference) — vLLM's main competitor, also good. Slightly different optimization choices. Pick whichever has better support for the specific models you use.
llama.cpp server — for CPU-only or modest GPU deployments. Slower than vLLM but works on hardware that vLLM rejects.
Modal / RunPod / Together — managed self-hosting. You don't run the GPU; you run code that runs on theirs. Great middle ground when you want self-hosting flexibility without owning hardware.
SGLang — newer competitor to vLLM, sometimes faster on specific workloads. Worth comparing if you're throughput-bound.

Hardware: what to actually buy

For a single-machine setup in 2026:

RTX 4090 (24GB) — runs 70B Q4 quantized comfortably, 13B unquantized, all 7B models freely. ~$1800 used.
RTX 5090 (32GB) — better than 4090, runs 70B Q5/Q6 quantized. ~$2500-3000 retail.
2× RTX 4090 — runs 70B at higher quality across both cards via tensor parallelism. ~$3600 used + decent PSU.
A100 / H100 — overkill for solo, makes sense for small teams. Used A100 80GB ~$8-12k.
Mac Studio M3 Ultra — surprisingly capable thanks to unified memory. 192GB shared memory means it runs 70B unquantized. Slow per-token but no other 2026 desktop runs that model size at all on consumer hardware.

For cloud GPU (no hardware purchase): Lambda Labs, RunPod, Together, or Modal are all solid. Per-hour pricing means you pay only when you use it, but a 24/7 server makes owned hardware cheaper after about 6 months.

Quantization choices

GGUF formats (Q4_K_M, Q5_K_M, Q6_K, Q8_0) trade quality for size. The rough rule:

Q8_0: nearly lossless, 50% size reduction
Q6_K: very small quality loss, 60% size reduction
Q5_K_M: noticeable but minor quality loss, 65% reduction
Q4_K_M: meaningful quality loss but acceptable, 70% reduction
Below Q4: significant degradation, only for desperately resource-constrained use

For production self-hosting, Q5_K_M or Q6_K is the sweet spot. AWQ and GPTQ are alternative quantization formats vLLM supports — different trade-offs but similar end results.

When NOT to self-host

Spending $300/month on Anthropic API and considering self-host? Don't. Hardware is $2000+, electricity is real, your time is real. Stay on API.

Do you have ops capacity? If your team is one engineer who is also responsible for the product, don't add a self-hosted GPU server to the mix. The day it goes down at 2am is not the day to be learning vLLM internals.

Does the model work for your use case? Run experiments on the API first. If Claude or GPT works for the product, switching to self-hosted Llama means accepting some quality drop, often more than you expect. Don't move to self-hosted as your first cost optimization.

A practical setup recipe

For a small production deployment:

Hardware: 1× H100 (rented or owned), 64GB system RAM, NVMe SSD
OS: Ubuntu 22.04 LTS
Inference: vLLM with --enable-prefix-caching and --max-model-len 16384
Model: Llama 3.1 70B Instruct in FP8 quantization
Reverse proxy: Caddy or nginx for TLS termination
Monitoring: Prometheus + Grafana for token throughput and GPU utilization
Logging: structured JSON logs to whatever you already use
Authentication: simple API keys or OAuth — don't expose unauthenticated

This stack handles 50-100 concurrent users comfortably for typical chat-style workloads.

Decision tree

Hobby, single user: Ollama or LM Studio
Side project, want to tinker: text-gen-webui or llama.cpp
Real product, real users, owned hardware: vLLM + owned GPU
Real product, no hardware budget: Modal, Together, or RunPod
Mac-only, large model: Ollama + Mac Studio M3 Ultra
CPU only: llama.cpp server

Next steps

Read about specific quantization formats: GGUF Q4 vs Q5 vs Q6
Look into LoRA inference for serving fine-tuned variants of base models
Read about vLLM specific tuning: max_num_seqs, gpu_memory_utilization
Set up monitoring early — you'll want it before something breaks