Skip to content

Intro★★★★★6 min read

What is fine-tuning, and when do you actually need it?

Fine-tuning trains a model on your data. It sounds like the obvious answer for any custom AI feature — but in 2026, it's almost never the right first move.

Fine-tuning means taking a pre-trained model and continuing to train it on your own data so it learns your task or your style. It's the most-asked-about technique by builders who think "I want a custom AI" — and the most often misapplied. In 2026, prompt engineering and RAG solve 80% of "custom" needs. Fine-tuning is the right tool for a smaller, specific set.

How fine-tuning actually works

A pre-trained LLM has fixed weights — billions of numbers learned during pre-training. Fine-tuning adjusts those numbers based on your dataset. Concretely:

  1. You prepare a dataset of input-output pairs. For an instruction-style fine-tune, that's hundreds to tens of thousands of examples like {"prompt": "...", "completion": "..."}.
  2. You run a training job (on OpenAI's fine-tuning API, on Together AI, Modal, or your own GPUs) that updates the model's weights to make your examples more likely outputs.
  3. You get back a new model checkpoint that you call instead of the base model.

For open-weight models, the modern approach is LoRA (Low-Rank Adaptation): you train tiny "adapter" matrices alongside frozen base weights, instead of updating the whole model. LoRA fine-tunes are dramatically cheaper (a single GPU, hours, not weeks) and the quality is competitive for most tasks.

Three things fine-tuning is great at

Style and tone. If you want every output to read like your brand voice, your legal team's writing, or a specific technical register, fine-tuning teaches the model the vibe in a way prompts can only approximate. 200-500 examples of "good" output usually transfer the style.

Structured output reliability. If you need outputs in a strict format (JSON with exact field names, specific markdown templates) and prompt-only approaches keep producing slight variations, fine-tuning makes the format almost free of errors. Modern alternatives: structured outputs / JSON mode, which often work without fine-tuning.

Cost reduction at scale. A fine-tuned smaller model (Llama 8B, Mistral 7B, GPT-4.1-mini) can match a much larger frontier model on a narrow task — at 10-50× lower per-token cost. This pays off only for high-volume production workloads (millions of queries/month).

Three things fine-tuning is bad at

Adding new factual knowledge. This is the #1 misconception. Fine-tuning teaches the model patterns, not facts. If you fine-tune on your company's HR policies, the model will sometimes recall them, sometimes hallucinate similar-sounding ones, and generally get worse at things outside your training set. For "the model should know X," use RAG, not fine-tuning.

Tasks where you don't have data. You need at least 100-500 high-quality examples for fine-tuning to outperform a good prompt. If you can't write 100 examples of correct output by hand, you don't have a fine-tuning problem yet.

Anything you'll iterate on quickly. Fine-tuning takes hours-to-days per training run, plus eval, plus deployment. Prompt engineering is seconds. For early-stage products where the spec changes weekly, fine-tuning slows you down.

When to actually fine-tune

A decision tree:

  1. Is your problem solvable by a better prompt? Try that for a week. Most are.
  2. Is your problem about referencing private/recent data? Use RAG.
  3. Do you need consistent style or format the prompt can't enforce? Now consider fine-tuning.
  4. Is your volume high enough that paying for a frontier model on every call is breaking your budget? Fine-tuning a smaller model can cut cost dramatically.

In practice, the cleanest 2026 use cases for fine-tuning are:

  • Customer support replies in a specific brand voice
  • Translating into a domain-specific style (legal, medical)
  • Code generation in a private codebase's idioms
  • Classification tasks at high volume (e.g., support ticket routing)
  • Distilling a frontier model's behavior into a cheaper model you can self-host

What you'll actually pay

A rough 2026 reality check:

  • OpenAI fine-tuning API (GPT-4.1-mini, GPT-4o-mini): ~$5-25 to train on 1,000-10,000 examples. Inference 1.5-3× the base model price.
  • Anthropic doesn't offer general public fine-tuning; reserved for enterprise customers.
  • LoRA on open-weight models via Together AI / Modal / Replicate: $5-50 for a 7B-13B model, depending on dataset size.
  • Self-hosted full fine-tune of a 70B model: hundreds of dollars on rented A100/H100 hours, plus you handle deployment.

For most teams, the fine-tuning cost is small. The real cost is the data prep — collecting, cleaning, labeling 500-5000 examples. That's where the real work lives.

When NOT to fine-tune

  • You haven't yet maxed out prompt engineering and RAG
  • You don't have an eval set to measure if fine-tuning actually helped
  • Your training data is small (<100 examples) or noisy
  • The task changes every two weeks
  • You don't have an MLOps person to maintain the trained model long-term

Further reading

  • What is RAG (Retrieval-Augmented Generation)
  • LoRA vs fine-tuning vs RAG: which solves which problem
  • Fine-tune a Llama 3 70B locally with LoRA
  • When fine-tuning beats prompt engineering (and when it doesn't)
  • How to evaluate LLM output quality at scale

Last updated: 2026-04-29

We use cookies

Anonymous analytics help us improve the site. You can opt out anytime. Learn more