Quantization

Compressing a model by storing its weights in lower precision (8-bit, 4-bit, even 2-bit) instead of 16- or 32-bit floats, dramatically cutting memory and speeding up inference.

Quantization stores model weights using fewer bits per number. A 70B-parameter model in 16-bit (BF16) takes 140 GB; quantized to 4-bit it takes 35 GB and fits on a single high-end consumer GPU. Inference speed often improves too, because memory bandwidth is the bottleneck. It matters because hosting a frontier model in full precision needs serious hardware. Quantization is what lets you run Llama 3 70B on a Mac Studio, fit Qwen 2.5 32B on a 24GB RTX 4090, or run a small model on a phone. For self-hosters and edge deployments, it's the difference between possible and not. The trade-off: quality drops a bit. INT8 is essentially free (no measurable quality loss). 4-bit (GPTQ, AWQ, GGUF Q4) keeps most of the quality with about 4× compression. 2-bit and 1.58-bit are research-grade and work for some models but break others. A concrete example: ollama and LM Studio default to Q4_K_M GGUF quantization — that's 4-bit weights with some 6-bit tweaks for sensitive layers. Performance on benchmarks usually drops 1-3% versus the full-precision version, but you fit a much bigger model on the same GPU. Related: GGUF, AWQ, GPTQ, INT8, distillation.