Skip to content

Advanced★★★★★9 min read

When fine-tuning beats prompt engineering (and when it doesn't)

Most teams jump to fine-tuning too early. The decision tree, the actual numbers, and the order to try things in.

Every team building on LLMs eventually asks the same question: should we fine-tune? Usually they ask after their prompts have ballooned to 3000 tokens, the model is still flaky, and someone read a blog post claiming fine-tuning solves everything.

Usually the answer is no. Or rather: not yet. Prompt engineering, RAG, and few-shot examples — in that order — solve 90% of "the model isn't doing what I want" problems. Fine-tuning is the right answer for the remaining 10%, and getting that 10% right requires knowing when you're actually in it.

This is the decision tree.

What each technique actually does

Quick recap so we're aligned:

  • Prompt engineering. Improving the wording, structure, and examples in your prompt to elicit better responses. No model change, no data, no training.
  • Few-shot prompting. A subset of prompt engineering: include 2-10 example input/output pairs in the prompt to teach format and behavior.
  • RAG. Retrieve relevant documents, inject into the prompt as context, model answers using them. Adds knowledge, not behavior.
  • Fine-tuning. Take a base model and continue training it on your data. Changes the model's actual weights, persists across calls, doesn't require sending examples in every prompt.

Each of these solves a different problem.

The decision tree

Given a problem, work this list in order. Don't skip steps.

Step 1: Better prompt + few-shot examples

Most "the model is wrong" problems are prompt problems. Try:

  • Be more specific. "Summarize this" → "Summarize this in 3 bullet points, each starting with an action verb."
  • Add few-shot examples. Show the model 3-5 input/output pairs that match what you want.
  • Use a stronger model. Sometimes Claude Sonnet beats Haiku at the same prompt; pay 4× to skip the engineering.
  • Constrain output. Use structured outputs / tool use / JSON mode to force the shape.
  • Re-read your system prompt. If it's longer than 1000 tokens, it's probably contradicting itself somewhere.

This solves: format compliance, tone, simple classification, basic extraction, most "please answer in this style" problems.

Step 2: RAG

If the problem is "the model doesn't know my data" — internal docs, recent events, customer-specific context — RAG is the answer 95% of the time. Fine-tuning won't fix this; the model needs the actual information at inference time.

This solves: "answer questions about our company," "reference our documentation," "summarize this specific document."

Step 3: Better tool use / agent design

If the model is "missing capability" — can't search the web, can't run code, can't update a database — give it tools. Don't fine-tune it to fake the capability.

This solves: "I want it to actually do things," "I want it to fetch fresh data," "I want it to interact with our systems."

Step 4: Fine-tuning

Only now consider fine-tuning. By this point you've exhausted the cheaper options. Fine-tuning is right when:

  • Format / behavior is consistent and prompts can't reliably enforce it. You need every output to look exactly like X, and even with 10 few-shot examples and structured outputs, 5% drift through.
  • Your prompt is too long to be cost-effective. A 5000-token prompt sent on every request is expensive. Fine-tuning bakes the behavior into weights so the prompt drops to 200 tokens.
  • Latency matters and tokens are the bottleneck. Same as above: fewer tokens in = faster response.
  • Privacy / compliance requires running locally. Fine-tuned open-weight model deployed in-house.
  • You have specific domain language the base model fumbles. Legal jargon, medical abbreviations, trading terminology where the base model defaults to layperson explanations.

Fine-tuning does NOT teach new facts reliably. If the model needs to know your company's API endpoints, fine-tuning will sort of work but RAG works better. Fine-tuning teaches behavior, not knowledge.

Real numbers

A team I worked with ran this experiment in 2025. Task: classify customer support tickets into 18 categories.

  • Baseline (Sonnet, no examples): 71% accuracy.
  • + 5 few-shot examples: 84% accuracy.
  • + stronger system prompt with category descriptions: 88% accuracy.
  • Switch to Opus: 92% accuracy. ($60 vs $5 per 1k classifications.)
  • Fine-tuned Haiku on 800 labeled examples: 91% accuracy. ($0.40 per 1k classifications.)

The fine-tuned Haiku matched Opus on accuracy at 1/150th the cost. That's when fine-tuning paid off — at scale, with a measurable gap, on a stable task.

If they'd had 100 classifications/day, the fine-tuning effort wouldn't have been worth it. They had 50,000/day. The math is brutal.

When fine-tuning fails (or wastes time)

Known anti-patterns:

  • Fine-tuning on under 200 examples. You'll teach style, not capability. Save the time.
  • Fine-tuning to teach facts. "My product launched in March 2026" — fine-tuning won't reliably encode this. Use RAG.
  • Fine-tuning on synthetic data without verification. LLM-generated training data has bugs. The model learns the bugs.
  • Fine-tuning instead of debugging the prompt. If prompt iteration would fix it, do that first. Fine-tuning a flaky problem just makes it a flaky problem you can't easily change.
  • Fine-tuning a frontier closed-source model when open-weight + RAG would do. Locks you into the provider's pricing forever.

What about LoRA on open-weight models?

LoRA changes the math significantly. Compared to full fine-tuning of GPT-4, LoRA on Llama 3.3 70B:

  • 100× cheaper to train.
  • 1000× cheaper to deploy.
  • Adapter is 200MB, easy to swap or version.
  • Same fundamental tradeoffs (still teaches behavior, not facts).

With LoRA, the threshold for "is fine-tuning worth it?" drops a lot. A team with 5,000 classifications/day might break even on a LoRA fine-tune where they wouldn't on a frontier-API fine-tune.

The hidden cost of fine-tuning

Three costs people forget:

  1. Maintenance. Base model updates regularly. Your fine-tuned variant doesn't get those upgrades for free. When Llama 3.4 ships and is 8% better on your tasks, you have to re-fine-tune.
  2. Eval debt. Fine-tuning needs a held-out test set. If you didn't have evals before, you have to build them now. Three days of work most teams skip and regret.
  3. Catastrophic forgetting. Narrow fine-tuning makes the model worse at general tasks. If your product expands beyond the training distribution, you'll see regressions you didn't anticipate.

A pragmatic order of operations

For any "the model isn't doing what I want" problem:

  1. Read your prompt out loud. Is it clear?
  2. Add 3-5 few-shot examples.
  3. Try the next stronger model (Haiku → Sonnet → Opus). Note the cost delta.
  4. Add structured outputs / tool use to constrain shape.
  5. If it's a knowledge problem: add RAG.
  6. If it's a capability problem: add tools.
  7. Now benchmark against your eval set. What's the gap?
  8. If the gap is small or you can pay for the bigger model: stop. You're done.
  9. If the gap is large, the task is stable, and you have 500+ labeled examples: fine-tune.
  10. If you fine-tune, run the same eval set. Compare. Be willing to roll back.

When NOT to fine-tune at all

  • Pre-MVP. The fastest iteration loop is prompt + base model. Don't lock yourself into weights you'll change next month.
  • Tiny dataset. Under 200 examples is just style transfer. Use few-shot.
  • Rapidly changing requirements. If your product's needs shift weekly, you can't re-fine-tune that fast. Stick with prompts.
  • You haven't measured whether prompt engineering is enough. No eval set means you're flying blind; fine-tuning makes the bug harder to find.

Further reading

  • The fine-tuning-llama-locally post in this Learn library.
  • Fine-tuning vs RAG — Microsoft Research piece, still relevant.
  • PEFT survey (Hu et al, 2024) — overview of LoRA, DoRA, and adapter methods.
  • Look up: catastrophic forgetting, instruction tuning, parameter-efficient fine-tuning, LoRA vs RAG.

Last updated: 2026-04-29

We use cookies

Anonymous analytics help us improve the site. You can opt out anytime. Learn more