Fine-tune a Llama 3 70B locally with LoRA

Two years ago, fine-tuning a 70B-parameter model at home was science fiction. In 2026, with QLoRA, FlashAttention-3, and consumer GPUs that ship 48GB VRAM, it's a doable weekend project — if you have the patience to learn the moving parts.

This post is the practical end-to-end. Hardware needed, dataset format, the actual hyperparameter values that work, and the failure modes that will eat your weekend if you're not warned.

The target: take Llama 3.3 70B Instruct and adapt it to your domain (legal, medical, customer support, whatever) using LoRA. You'll end up with a small adapter file (~200MB) that you can attach to the base model at inference time.

Hardware: what you actually need

The honest minimum for 70B QLoRA in 2026:

GPU: 1× NVIDIA RTX 5090 (32GB) or 1× RTX 6000 Pro Blackwell (96GB). The 5090 works for QLoRA only and is tight. The 6000 Pro is the comfortable option and runs full LoRA with longer sequences.
RAM: 64GB system RAM minimum. 128GB makes things smoother during data loading.
Disk: 250GB free for the base model checkpoint, your dataset, and intermediate checkpoints. NVMe strongly recommended.
OS: Linux (Ubuntu 24.04 ideal). WSL2 works but with 10-20% perf hit. Windows native — don't bother.

If you don't have hardware: rent an A100 80GB on RunPod for ~$1.50/hr or an H100 for ~$2.50/hr. A 70B fine-tune typically takes 4-12 hours on H100. Total cost $10-30.

Dataset: format and size

The single biggest mistake first-time fine-tuners make: not enough data, or data of the wrong format.

Format. The 2026 standard is JSONL with a messages array, OpenAI-style:

{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

One example per line. Tools like Axolotl and Unsloth read this directly.

Size. Rules of thumb:

100 examples: enough to teach the model a style (output format, tone). Not enough to teach new knowledge.
1,000 examples: enough for narrow task adaptation (classification, extraction, structured outputs).
10,000 examples: real domain expertise; the model behaves noticeably different on real test cases.
100,000+ examples: diminishing returns vs effort to curate.

For a first project, 1,000-3,000 high-quality examples beats 50,000 mediocre ones every time.

Quality matters more than quantity. A 1,000-example dataset of human-curated good answers will teach the model far more than 50,000 LLM-generated examples that all look subtly the same.

The 2026 toolchain

The four pieces you need:

Unsloth (recommended for solo developers). 2× faster than vanilla transformers, lower VRAM, well-documented for LoRA. The recipe of choice in 2026.
Axolotl (recommended for serious work). More flexible config, supports more architectures, better for production teams.
PEFT (HuggingFace). The library that implements LoRA / QLoRA underneath both above.
bitsandbytes. 4-bit quantization for QLoRA.

For this guide we'll use Unsloth.

The hyperparameters that matter

For LoRA on 70B, the values that work in practice:

LoRA rank (r): 16 for narrow tasks, 32 or 64 for broader adaptation. Higher = more capacity, more VRAM, slower train. Start with 16.
LoRA alpha: 2× rank. So alpha=32 for r=16. Modulates the scaling of LoRA updates.
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj (all linear layers). 2024-era guides only targeted attention; 2025+ research showed targeting MLPs too gives a real bump.
Learning rate: 2e-4 for LoRA, 1e-4 for QLoRA. If loss explodes, halve. If loss is flat, double once.
Batch size: 1 per GPU with gradient_accumulation_steps=8-16. Effective batch size 8-16 is the sweet spot.
Epochs: 1-3. More epochs on small datasets = overfitting. Watch eval loss; stop when it stops dropping.
Max seq length: 2048 to start. Longer = more VRAM. Increase if your data needs it.
Warmup steps: 5-10% of total steps. Avoid loss spikes early on.

What a working run looks like

A 2,000-example fine-tune of Llama 3.3 70B with QLoRA on a single H100:

VRAM peak: ~62GB.
Time per epoch: ~80 minutes.
Total wall time (3 epochs + checkpointing): ~5 hours.
Final adapter size: 250MB.
Train loss start: 1.4. Train loss end: 0.6. Eval loss end: 0.7.

The loss numbers are vague — what matters is whether the model behaves differently on your real test prompts.

How to know it actually worked

Don't trust loss curves. Build a held-out test set of 30-50 inputs you didn't include in training. Compare the base model's outputs to the fine-tuned model's outputs side by side. If the fine-tune isn't visibly better on your real domain, the dataset was the problem, not the training.

Things to look for:

Does it follow your output format consistently?
Does it use your domain's vocabulary correctly?
Did it forget how to do other things (catastrophic forgetting)? If it can't do basic math anymore, your dataset was too narrow.
Run a generic benchmark (HellaSwag, MMLU on a 200-question sample) before and after to detect regressions.

Catastrophic forgetting and how to avoid it

LoRA partially mitigates this — the base weights aren't being touched — but adapter merging at inference time can still cause regression on tasks unlike your fine-tune set.

Mitigations:

Mix in general data. 10-20% of your training data should be general instruction-following data (e.g. samples from Tulu, OpenAssistant) to keep broad capabilities.
Lower rank. Higher LoRA rank = more aggressive adaptation = more forgetting risk. r=16 is safer than r=64 for narrow tasks.
Don't over-train. Watch eval loss. When it stops dropping, stop.

When NOT to fine-tune

Most "I should fine-tune this" instincts are wrong. Try these first:

Better prompting. Most narrow-task wins come from few-shot examples in the system prompt. Often free.
RAG. If the issue is "the model doesn't know my data," RAG is the answer 95% of the time. Fine-tuning doesn't reliably teach facts; it teaches behavior.
A bigger frontier model. If Claude 4.7 Opus or GPT-5 already does what you need, the cost-to-benefit usually beats running your own fine-tuned 70B.

Fine-tune when: (a) you need consistent output format that prompting can't enforce, (b) you have privacy/compliance reasons to keep weights local, (c) prompting hits a quality ceiling on your specific task and you have the data to push past it.

Production deployment

Once your adapter is trained, you have two paths:

Merge the adapter into base weights. Single combined model, simplest to deploy. Lose ability to swap adapters.
Keep adapter separate, load at runtime. Frameworks like vLLM and TGI support this. You can host one base model and swap adapters per request.

For serving, vLLM with the merged model is the production-grade choice in 2026. Self-host vLLM has its own dedicated post.