When fine-tuning beats prompt engineering (and when it doesn't)

Every team building on LLMs eventually asks the same question: should we fine-tune? Usually they ask after their prompts have ballooned to 3000 tokens, the model is still flaky, and someone read a blog post claiming fine-tuning solves everything.

Usually the answer is no. Or rather: not yet. Prompt engineering, RAG, and few-shot examples — in that order — solve 90% of "the model isn't doing what I want" problems. Fine-tuning is the right answer for the remaining 10%, and getting that 10% right requires knowing when you're actually in it.

This is the decision tree.

What each technique actually does

Quick recap so we're aligned:

Prompt engineering. Improving the wording, structure, and examples in your prompt to elicit better responses. No model change, no data, no training.
Few-shot prompting. A subset of prompt engineering: include 2-10 example input/output pairs in the prompt to teach format and behavior.
RAG. Retrieve relevant documents, inject into the prompt as context, model answers using them. Adds knowledge, not behavior.
Fine-tuning. Take a base model and continue training it on your data. Changes the model's actual weights, persists across calls, doesn't require sending examples in every prompt.

Each of these solves a different problem.

The decision tree

Given a problem, work this list in order. Don't skip steps.

Step 1: Better prompt + few-shot examples

Most "the model is wrong" problems are prompt problems. Try:

Be more specific. "Summarize this" → "Summarize this in 3 bullet points, each starting with an action verb."
Add few-shot examples. Show the model 3-5 input/output pairs that match what you want.
Use a stronger model. Sometimes Claude Sonnet beats Haiku at the same prompt; pay 4× to skip the engineering.
Constrain output. Use structured outputs / tool use / JSON mode to force the shape.
Re-read your system prompt. If it's longer than 1000 tokens, it's probably contradicting itself somewhere.

This solves: format compliance, tone, simple classification, basic extraction, most "please answer in this style" problems.

Step 2: RAG

If the problem is "the model doesn't know my data" — internal docs, recent events, customer-specific context — RAG is the answer 95% of the time. Fine-tuning won't fix this; the model needs the actual information at inference time.

This solves: "answer questions about our company," "reference our documentation," "summarize this specific document."

Step 3: Better tool use / agent design

If the model is "missing capability" — can't search the web, can't run code, can't update a database — give it tools. Don't fine-tune it to fake the capability.

This solves: "I want it to actually do things," "I want it to fetch fresh data," "I want it to interact with our systems."

Step 4: Fine-tuning

Only now consider fine-tuning. By this point you've exhausted the cheaper options. Fine-tuning is right when:

Format / behavior is consistent and prompts can't reliably enforce it. You need every output to look exactly like X, and even with 10 few-shot examples and structured outputs, 5% drift through.
Your prompt is too long to be cost-effective. A 5000-token prompt sent on every request is expensive. Fine-tuning bakes the behavior into weights so the prompt drops to 200 tokens.
Latency matters and tokens are the bottleneck. Same as above: fewer tokens in = faster response.
Privacy / compliance requires running locally. Fine-tuned open-weight model deployed in-house.
You have specific domain language the base model fumbles. Legal jargon, medical abbreviations, trading terminology where the base model defaults to layperson explanations.

Fine-tuning does NOT teach new facts reliably. If the model needs to know your company's API endpoints, fine-tuning will sort of work but RAG works better. Fine-tuning teaches behavior, not knowledge.

Real numbers

A team I worked with ran this experiment in 2025. Task: classify customer support tickets into 18 categories.

Baseline (Sonnet, no examples): 71% accuracy.
+ 5 few-shot examples: 84% accuracy.
+ stronger system prompt with category descriptions: 88% accuracy.
Switch to Opus: 92% accuracy. ($60 vs $5 per 1k classifications.)
Fine-tuned Haiku on 800 labeled examples: 91% accuracy. ($0.40 per 1k classifications.)

The fine-tuned Haiku matched Opus on accuracy at 1/150th the cost. That's when fine-tuning paid off — at scale, with a measurable gap, on a stable task.

If they'd had 100 classifications/day, the fine-tuning effort wouldn't have been worth it. They had 50,000/day. The math is brutal.

When fine-tuning fails (or wastes time)

Known anti-patterns:

Fine-tuning on under 200 examples. You'll teach style, not capability. Save the time.
Fine-tuning to teach facts. "My product launched in March 2026" — fine-tuning won't reliably encode this. Use RAG.
Fine-tuning on synthetic data without verification. LLM-generated training data has bugs. The model learns the bugs.
Fine-tuning instead of debugging the prompt. If prompt iteration would fix it, do that first. Fine-tuning a flaky problem just makes it a flaky problem you can't easily change.
Fine-tuning a frontier closed-source model when open-weight + RAG would do. Locks you into the provider's pricing forever.

What about LoRA on open-weight models?

LoRA changes the math significantly. Compared to full fine-tuning of GPT-4, LoRA on Llama 3.3 70B:

100× cheaper to train.
1000× cheaper to deploy.
Adapter is 200MB, easy to swap or version.
Same fundamental tradeoffs (still teaches behavior, not facts).

With LoRA, the threshold for "is fine-tuning worth it?" drops a lot. A team with 5,000 classifications/day might break even on a LoRA fine-tune where they wouldn't on a frontier-API fine-tune.

The hidden cost of fine-tuning

Three costs people forget:

Maintenance. Base model updates regularly. Your fine-tuned variant doesn't get those upgrades for free. When Llama 3.4 ships and is 8% better on your tasks, you have to re-fine-tune.
Eval debt. Fine-tuning needs a held-out test set. If you didn't have evals before, you have to build them now. Three days of work most teams skip and regret.
Catastrophic forgetting. Narrow fine-tuning makes the model worse at general tasks. If your product expands beyond the training distribution, you'll see regressions you didn't anticipate.

A pragmatic order of operations

For any "the model isn't doing what I want" problem:

Read your prompt out loud. Is it clear?
Add 3-5 few-shot examples.
Try the next stronger model (Haiku → Sonnet → Opus). Note the cost delta.
Add structured outputs / tool use to constrain shape.
If it's a knowledge problem: add RAG.
If it's a capability problem: add tools.
Now benchmark against your eval set. What's the gap?
If the gap is small or you can pay for the bigger model: stop. You're done.
If the gap is large, the task is stable, and you have 500+ labeled examples: fine-tune.
If you fine-tune, run the same eval set. Compare. Be willing to roll back.

When NOT to fine-tune at all

Pre-MVP. The fastest iteration loop is prompt + base model. Don't lock yourself into weights you'll change next month.
Tiny dataset. Under 200 examples is just style transfer. Use few-shot.
Rapidly changing requirements. If your product's needs shift weekly, you can't re-fine-tune that fast. Stick with prompts.
You haven't measured whether prompt engineering is enough. No eval set means you're flying blind; fine-tuning makes the bug harder to find.