Fine-tuning means taking a pre-trained model and continuing to train it on your own data so it learns your task or your style. It's the most-asked-about technique by builders who think "I want a custom AI" — and the most often misapplied. In 2026, prompt engineering and RAG solve 80% of "custom" needs. Fine-tuning is the right tool for a smaller, specific set.
How fine-tuning actually works
A pre-trained LLM has fixed weights — billions of numbers learned during pre-training. Fine-tuning adjusts those numbers based on your dataset. Concretely:
- You prepare a dataset of input-output pairs. For an instruction-style fine-tune, that's hundreds to tens of thousands of examples like
{"prompt": "...", "completion": "..."}. - You run a training job (on OpenAI's fine-tuning API, on Together AI, Modal, or your own GPUs) that updates the model's weights to make your examples more likely outputs.
- You get back a new model checkpoint that you call instead of the base model.
For open-weight models, the modern approach is LoRA (Low-Rank Adaptation): you train tiny "adapter" matrices alongside frozen base weights, instead of updating the whole model. LoRA fine-tunes are dramatically cheaper (a single GPU, hours, not weeks) and the quality is competitive for most tasks.
Three things fine-tuning is great at
Style and tone. If you want every output to read like your brand voice, your legal team's writing, or a specific technical register, fine-tuning teaches the model the vibe in a way prompts can only approximate. 200-500 examples of "good" output usually transfer the style.
Structured output reliability. If you need outputs in a strict format (JSON with exact field names, specific markdown templates) and prompt-only approaches keep producing slight variations, fine-tuning makes the format almost free of errors. Modern alternatives: structured outputs / JSON mode, which often work without fine-tuning.
Cost reduction at scale. A fine-tuned smaller model (Llama 8B, Mistral 7B, GPT-4.1-mini) can match a much larger frontier model on a narrow task — at 10-50× lower per-token cost. This pays off only for high-volume production workloads (millions of queries/month).
Three things fine-tuning is bad at
Adding new factual knowledge. This is the #1 misconception. Fine-tuning teaches the model patterns, not facts. If you fine-tune on your company's HR policies, the model will sometimes recall them, sometimes hallucinate similar-sounding ones, and generally get worse at things outside your training set. For "the model should know X," use RAG, not fine-tuning.
Tasks where you don't have data. You need at least 100-500 high-quality examples for fine-tuning to outperform a good prompt. If you can't write 100 examples of correct output by hand, you don't have a fine-tuning problem yet.
Anything you'll iterate on quickly. Fine-tuning takes hours-to-days per training run, plus eval, plus deployment. Prompt engineering is seconds. For early-stage products where the spec changes weekly, fine-tuning slows you down.
When to actually fine-tune
A decision tree:
- Is your problem solvable by a better prompt? Try that for a week. Most are.
- Is your problem about referencing private/recent data? Use RAG.
- Do you need consistent style or format the prompt can't enforce? Now consider fine-tuning.
- Is your volume high enough that paying for a frontier model on every call is breaking your budget? Fine-tuning a smaller model can cut cost dramatically.
In practice, the cleanest 2026 use cases for fine-tuning are:
- Customer support replies in a specific brand voice
- Translating into a domain-specific style (legal, medical)
- Code generation in a private codebase's idioms
- Classification tasks at high volume (e.g., support ticket routing)
- Distilling a frontier model's behavior into a cheaper model you can self-host
What you'll actually pay
A rough 2026 reality check:
- OpenAI fine-tuning API (GPT-4.1-mini, GPT-4o-mini): ~$5-25 to train on 1,000-10,000 examples. Inference 1.5-3× the base model price.
- Anthropic doesn't offer general public fine-tuning; reserved for enterprise customers.
- LoRA on open-weight models via Together AI / Modal / Replicate: $5-50 for a 7B-13B model, depending on dataset size.
- Self-hosted full fine-tune of a 70B model: hundreds of dollars on rented A100/H100 hours, plus you handle deployment.
For most teams, the fine-tuning cost is small. The real cost is the data prep — collecting, cleaning, labeling 500-5000 examples. That's where the real work lives.
When NOT to fine-tune
- You haven't yet maxed out prompt engineering and RAG
- You don't have an eval set to measure if fine-tuning actually helped
- Your training data is small (<100 examples) or noisy
- The task changes every two weeks
- You don't have an MLOps person to maintain the trained model long-term
Further reading
- What is RAG (Retrieval-Augmented Generation)
- LoRA vs fine-tuning vs RAG: which solves which problem
- Fine-tune a Llama 3 70B locally with LoRA
- When fine-tuning beats prompt engineering (and when it doesn't)
- How to evaluate LLM output quality at scale