QLoRA

A fine-tuning technique that combines 4-bit quantization with LoRA, letting you fine-tune large models on a single consumer GPU.

QLoRA (Quantized Low-Rank Adaptation) is a parameter-efficient fine-tuning method that loads a pretrained LLM in 4-bit precision and trains only small LoRA adapter weights on top. It was introduced by Tim Dettmers and collaborators in a 2023 paper, and quickly became the default approach for fine-tuning large open-source models on limited hardware. It matters because full fine-tuning of a 65B-parameter model traditionally needed hundreds of gigabytes of GPU memory — out of reach for most developers. QLoRA squeezes the frozen base model into 4-bit using a custom data type called NF4 (NormalFloat 4-bit), then trains a tiny set of LoRA matrices in higher precision. The original paper showed you could fine-tune a 65B model on a single 48GB GPU and match the quality of 16-bit full fine-tuning. Concretely: imagine you want to teach Llama or Mistral your company's writing style. Full fine-tuning would need an A100 cluster. With QLoRA via libraries like Hugging Face PEFT and bitsandbytes, you can do it overnight on a single RTX 4090 or a rented cloud GPU, and the resulting adapter file is often under 100MB. Trade-offs: inference is slightly slower than running an unquantized model, and very aggressive quantization can hurt quality on some tasks. But for most domain adaptation and instruction-tuning workflows, QLoRA is the practical default. Related: LoRA, PEFT, quantization, bitsandbytes, fine-tuning, NF4.