LoRA (Low-Rank Adaptation)

A fine-tuning technique that adapts large models by training small low-rank matrices instead of updating all the original weights.

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method for large language models and diffusion models. Instead of updating all the billions of parameters in a base model, LoRA freezes the original weights and inserts small trainable matrices — the "low-rank" decomposition — into selected layers. Only these much smaller matrices get trained. This matters because full fine-tuning of a model like Llama-70B requires hundreds of GB of GPU memory and produces a huge new checkpoint. A LoRA adapter for the same model might only be a few MB to a few hundred MB, trains in hours on a single GPU, and can be swapped in and out at inference time. It's the technique behind most community-trained Stable Diffusion styles and most custom-tuned open-source LLMs you see on Hugging Face. The intuition: when you adapt a pretrained model to a new task, the *change* in weights tends to be low-dimensional — you don't really need a full-rank update. So instead of learning a big weight delta ΔW, LoRA learns ΔW = B·A where B and A are skinny matrices (rank 8, 16, 32 are typical). For a 4096×4096 weight matrix at rank 8, that's ~65K parameters instead of ~16M. You'll often see LoRA combined with quantization (QLoRA) to fine-tune very large models on consumer hardware. Multiple LoRAs can also be merged or stacked — popular in image generation for combining a character LoRA with a style LoRA. Related: fine-tuning, PEFT (parameter-efficient fine-tuning), QLoRA, adapters, Hugging Face PEFT library.