If you've ever poked around running an open-source LLM locally, you've seen filenames like llama-3-70b-Q4_K_M.gguf and wondered what the Q4 part means. It's the difference between needing $40,000 of GPU and needing one $1,200 RTX 4090. Quantization is the technique that makes local LLM use practical for non-millionaires. Here's what it actually does and where it breaks.
What's stored in a model
A modern LLM is hundreds of millions to hundreds of billions of numbers, called parameters or weights. Each one is, by default, a 16-bit or 32-bit floating point number. A 70B parameter model in 16-bit takes 140GB of memory just to load (2 bytes × 70B). That's a problem because:
- A consumer GPU like RTX 4090 has 24GB of VRAM
- A pro GPU like H100 has 80GB
- You'd need 4-8 H100s to run an unquantized 70B model, which costs $100k+ to buy
The insight: those 16 or 32 bits per number are mostly precision that doesn't matter for LLM inference. You can replace each number with a much smaller approximation, and the model still mostly works.
What quantization actually does
Quantization is the process of reducing the bit-width of model weights. The most common levels:
- FP16 / BF16 (16-bit) — the standard 'unquantized' format. Two bytes per parameter.
- INT8 (8-bit) — half the size, ~95-99% of original quality on most tasks.
- INT4 / Q4 (4-bit) — quarter the size, 90-97% of quality with modern techniques.
- Q3 / Q2 — eighth the size or smaller, noticeable quality drop, only useful for very large models or very tight memory constraints.
For a 70B model:
- BF16: 140GB
- INT8: 70GB
- Q4_K_M: ~40GB
- Q3_K_M: ~32GB
40GB fits in two RTX 3090s ($1,500-2,000 used), or one RTX 6000 Ada ($7,000), instead of needing $100k of H100s for the BF16 version.
How it works (without the math)
Naive quantization is: 'find the min and max values in this set of numbers, then replace each with the nearest of 16 (for 4-bit) evenly-spaced values.' This is fast and lossy.
Modern quantization is much cleverer. Techniques like GPTQ, AWQ, and GGUF Q-types (Q4_K_M, Q4_K_S, Q5_K_M etc.) do things like:
- Quantize different parts of the model with different precisions, keeping the most sensitive layers higher-bit
- Use 'mixed precision' within a tensor — group similar weights together and quantize each group separately
- Apply scaling factors per group so the dynamic range is preserved better
- Calibrate quantization on a small sample of real data to minimize error
The result is that Q4_K_M quantization typically loses only 1-3 percentage points on benchmarks vs. the original BF16 model. For chat use, you often can't tell the difference. For code or reasoning at the edge of model capability, you sometimes can.
What the suffixes mean
When you see Q4_K_M and Q5_K_S and Q4_0, here's the rough decoder ring for the GGUF format that llama.cpp uses:
- Q4 / Q5 / Q6 / Q8 — the bit-width: 4, 5, 6, 8 bits
- K — modern 'k-quants' that use mixed-precision blocks (much higher quality than older quants at same size)
- S / M / L — small / medium / large variants of K-quants. Larger means slightly higher size and quality.
- 0 / 1 — older legacy quant types (Q4_0, Q4_1). Avoid for new use; K-quants are strictly better.
Rules of thumb: Q4_K_M is the best general-purpose 4-bit. Q5_K_M is a slight quality bump for ~25% more size. Q8_0 is essentially indistinguishable from BF16 for most tasks. Q3 and below are last-resort for memory-constrained setups.
When quantization works well vs. when it breaks
Works well:
- Models 30B+ tend to handle quantization gracefully — there's enough redundancy in the weights
- Standard chat, summarization, translation, coding — quality loss is small
- Quantizing for inference (the typical case) is much more forgiving than quantizing for training
Breaks down:
- Tiny models (under 7B) often suffer noticeably from aggressive quantization. Q4 of a 1.5B model can be much worse than Q8 of the same model.
- Very long context generation — quantization errors can compound over thousands of generated tokens
- Tasks at the edge of the model's capability (hard math, novel reasoning) — losing 2-3% accuracy can mean failing where the unquantized model succeeded
- MoE models — see the MoE article for why this is harder
- Vision and audio models — image tokens are sometimes more quantization-sensitive than text
What about training in low precision?
Different topic with overlapping terminology. 'Mixed precision training' (BF16 + FP32) has been standard for years. 'FP8 training' is now used by frontier labs (DeepSeek-V3, Llama 3.1) to dramatically reduce training cost. INT8 / INT4 training mostly doesn't work — quality suffers too much. The 'INT4 quantization' you see is almost always post-training quantization, not training in INT4 from scratch.
When NOT to quantize
For production API serving where every percentage point of quality matters and you have the GPUs, run BF16. Frontier labs do this internally even when they offer quantized versions externally.
For research where you're measuring small effects, quantization can confound results. If you're comparing models, compare at the same precision level.
For extremely small models (under 3B) targeting edge devices — sometimes a smaller-but-unquantized model is better than a larger-but-Q4 model. Test both.
For mission-critical reasoning (medical, legal, financial decisions) — even a 1-2% quality drop matters. Don't quantize.
Tools that handle this for you
- llama.cpp — the dominant inference engine for quantized models. Reads GGUF format. Runs on CPU, GPU, or Apple Silicon.
- Ollama — wrapper around llama.cpp that downloads and runs models with a friendly CLI. Defaults to Q4_K_M.
- LM Studio — desktop app, lets you compare quantization levels side by side.
- vLLM — production-grade inference server, supports AWQ and GPTQ for high-throughput serving.
- Hugging Face transformers + bitsandbytes — for load-time INT8/INT4 quantization in Python.
Most users only need to know: pick Q4_K_M unless you have a reason not to.
Further reading
- How to pick a self-host stack — practical setup for running quantized models locally
- MoE explained — the other big architecture lever for inference cost
- Speculative decoding — yet another way to make inference faster