Every team building on LLMs eventually asks the same question: should we fine-tune? Usually they ask after their prompts have ballooned to 3000 tokens, the model is still flaky, and someone read a blog post claiming fine-tuning solves everything.
Usually the answer is no. Or rather: not yet. Prompt engineering, RAG, and few-shot examples — in that order — solve 90% of "the model isn't doing what I want" problems. Fine-tuning is the right answer for the remaining 10%, and getting that 10% right requires knowing when you're actually in it.
This is the decision tree.
What each technique actually does
Quick recap so we're aligned:
- Prompt engineering. Improving the wording, structure, and examples in your prompt to elicit better responses. No model change, no data, no training.
- Few-shot prompting. A subset of prompt engineering: include 2-10 example input/output pairs in the prompt to teach format and behavior.
- RAG. Retrieve relevant documents, inject into the prompt as context, model answers using them. Adds knowledge, not behavior.
- Fine-tuning. Take a base model and continue training it on your data. Changes the model's actual weights, persists across calls, doesn't require sending examples in every prompt.
Each of these solves a different problem.
The decision tree
Given a problem, work this list in order. Don't skip steps.
Step 1: Better prompt + few-shot examples
Most "the model is wrong" problems are prompt problems. Try:
- Be more specific. "Summarize this" → "Summarize this in 3 bullet points, each starting with an action verb."
- Add few-shot examples. Show the model 3-5 input/output pairs that match what you want.
- Use a stronger model. Sometimes Claude Sonnet beats Haiku at the same prompt; pay 4× to skip the engineering.
- Constrain output. Use structured outputs / tool use / JSON mode to force the shape.
- Re-read your system prompt. If it's longer than 1000 tokens, it's probably contradicting itself somewhere.
This solves: format compliance, tone, simple classification, basic extraction, most "please answer in this style" problems.
Step 2: RAG
If the problem is "the model doesn't know my data" — internal docs, recent events, customer-specific context — RAG is the answer 95% of the time. Fine-tuning won't fix this; the model needs the actual information at inference time.
This solves: "answer questions about our company," "reference our documentation," "summarize this specific document."
Step 3: Better tool use / agent design
If the model is "missing capability" — can't search the web, can't run code, can't update a database — give it tools. Don't fine-tune it to fake the capability.
This solves: "I want it to actually do things," "I want it to fetch fresh data," "I want it to interact with our systems."
Step 4: Fine-tuning
Only now consider fine-tuning. By this point you've exhausted the cheaper options. Fine-tuning is right when:
- Format / behavior is consistent and prompts can't reliably enforce it. You need every output to look exactly like X, and even with 10 few-shot examples and structured outputs, 5% drift through.
- Your prompt is too long to be cost-effective. A 5000-token prompt sent on every request is expensive. Fine-tuning bakes the behavior into weights so the prompt drops to 200 tokens.
- Latency matters and tokens are the bottleneck. Same as above: fewer tokens in = faster response.
- Privacy / compliance requires running locally. Fine-tuned open-weight model deployed in-house.
- You have specific domain language the base model fumbles. Legal jargon, medical abbreviations, trading terminology where the base model defaults to layperson explanations.
Fine-tuning does NOT teach new facts reliably. If the model needs to know your company's API endpoints, fine-tuning will sort of work but RAG works better. Fine-tuning teaches behavior, not knowledge.
Real numbers
A team I worked with ran this experiment in 2025. Task: classify customer support tickets into 18 categories.
- Baseline (Sonnet, no examples): 71% accuracy.
- + 5 few-shot examples: 84% accuracy.
- + stronger system prompt with category descriptions: 88% accuracy.
- Switch to Opus: 92% accuracy. ($60 vs $5 per 1k classifications.)
- Fine-tuned Haiku on 800 labeled examples: 91% accuracy. ($0.40 per 1k classifications.)
The fine-tuned Haiku matched Opus on accuracy at 1/150th the cost. That's when fine-tuning paid off — at scale, with a measurable gap, on a stable task.
If they'd had 100 classifications/day, the fine-tuning effort wouldn't have been worth it. They had 50,000/day. The math is brutal.
When fine-tuning fails (or wastes time)
Known anti-patterns:
- Fine-tuning on under 200 examples. You'll teach style, not capability. Save the time.
- Fine-tuning to teach facts. "My product launched in March 2026" — fine-tuning won't reliably encode this. Use RAG.
- Fine-tuning on synthetic data without verification. LLM-generated training data has bugs. The model learns the bugs.
- Fine-tuning instead of debugging the prompt. If prompt iteration would fix it, do that first. Fine-tuning a flaky problem just makes it a flaky problem you can't easily change.
- Fine-tuning a frontier closed-source model when open-weight + RAG would do. Locks you into the provider's pricing forever.
What about LoRA on open-weight models?
LoRA changes the math significantly. Compared to full fine-tuning of GPT-4, LoRA on Llama 3.3 70B:
- 100× cheaper to train.
- 1000× cheaper to deploy.
- Adapter is 200MB, easy to swap or version.
- Same fundamental tradeoffs (still teaches behavior, not facts).
With LoRA, the threshold for "is fine-tuning worth it?" drops a lot. A team with 5,000 classifications/day might break even on a LoRA fine-tune where they wouldn't on a frontier-API fine-tune.
The hidden cost of fine-tuning
Three costs people forget:
- Maintenance. Base model updates regularly. Your fine-tuned variant doesn't get those upgrades for free. When Llama 3.4 ships and is 8% better on your tasks, you have to re-fine-tune.
- Eval debt. Fine-tuning needs a held-out test set. If you didn't have evals before, you have to build them now. Three days of work most teams skip and regret.
- Catastrophic forgetting. Narrow fine-tuning makes the model worse at general tasks. If your product expands beyond the training distribution, you'll see regressions you didn't anticipate.
A pragmatic order of operations
For any "the model isn't doing what I want" problem:
- Read your prompt out loud. Is it clear?
- Add 3-5 few-shot examples.
- Try the next stronger model (Haiku → Sonnet → Opus). Note the cost delta.
- Add structured outputs / tool use to constrain shape.
- If it's a knowledge problem: add RAG.
- If it's a capability problem: add tools.
- Now benchmark against your eval set. What's the gap?
- If the gap is small or you can pay for the bigger model: stop. You're done.
- If the gap is large, the task is stable, and you have 500+ labeled examples: fine-tune.
- If you fine-tune, run the same eval set. Compare. Be willing to roll back.
When NOT to fine-tune at all
- Pre-MVP. The fastest iteration loop is prompt + base model. Don't lock yourself into weights you'll change next month.
- Tiny dataset. Under 200 examples is just style transfer. Use few-shot.
- Rapidly changing requirements. If your product's needs shift weekly, you can't re-fine-tune that fast. Stick with prompts.
- You haven't measured whether prompt engineering is enough. No eval set means you're flying blind; fine-tuning makes the bug harder to find.
Further reading
- The fine-tuning-llama-locally post in this Learn library.
- Fine-tuning vs RAG — Microsoft Research piece, still relevant.
- PEFT survey (Hu et al, 2024) — overview of LoRA, DoRA, and adapter methods.
- Look up: catastrophic forgetting, instruction tuning, parameter-efficient fine-tuning, LoRA vs RAG.