RLHF vs DPO: how modern alignment techniques differ

When GPT-3 launched in 2020, you could ask it 'how do I bake bread' and it would respond with five more questions you should ask, because that's what its training data looked like — internet text where one question often led to another. The transformation from 'autocomplete on the internet' to 'helpful assistant that answers' is what alignment is. RLHF and DPO are the two dominant techniques labs use to do that.

This article explains both at a level useful for someone who isn't training models but is trying to understand why some models feel sycophantic, why some refuse benign questions, and why model behavior shifts between versions. The math is hairy but the concepts are not.

Where these techniques fit in the pipeline

A modern frontier LLM goes through three rough stages:

Pre-training — predict next token on trillions of internet tokens. Result: model that knows facts and language structure but doesn't act like an assistant.
Supervised fine-tuning (SFT) — show it ~10k-100k examples of high-quality 'question → ideal answer' pairs written by humans. Result: model now answers like an assistant but is still inconsistent and easily veers into bad behavior.
Preference optimization — RLHF or DPO. Show it pairs of answers and tell it which one humans prefer. Result: model whose answers consistently match human preferences (helpful, harmless, honest, well-formatted, etc).

RLHF and DPO are both ways of doing step 3.

RLHF (Reinforcement Learning from Human Feedback)

RLHF was the original technique. OpenAI used it for InstructGPT in 2022, then for ChatGPT, then everyone copied. It works in two sub-steps:

Step A: train a reward model. Show humans pairs of model outputs ('answer A' vs 'answer B' for the same prompt) and ask which is better. Collect tens of thousands of these comparisons. Train a separate neural network — the reward model — to predict which answer humans would prefer.

Step B: use reinforcement learning to update the LLM so that when it generates answers, the reward model gives them high scores. Specifically, an algorithm called PPO (Proximal Policy Optimization) tweaks the LLM's weights to maximize reward while not drifting too far from the original SFT model.

The whole process is delicate. The reward model can be gamed (the LLM finds adversarial outputs that get high reward but are bad). Training is unstable — too much pressure on reward and the model collapses into repetitive nonsense that maxes the reward. It requires three model copies in memory at once (the LLM being trained, a frozen reference, the reward model). Compute is expensive.

When it works well, RLHF produces models that feel polished and helpful. When it goes wrong, you get sycophancy ('what a great question!'), refusal of benign requests ('I can't help with that' for things that are obviously fine), or inconsistent behavior across topics.

DPO (Direct Preference Optimization)

DPO was introduced in a 2023 paper by Rafailov et al. It got popular fast in 2024 and is now the default at many labs. The clever insight: you don't actually need a separate reward model. You can use the preference data directly to update the LLM via a clever loss function derived from RLHF math.

In practice this means: collect the same pair preference data as RLHF, but instead of training reward model + RL, you do one supervised-learning-style update on the LLM that increases the probability of preferred answers and decreases the probability of rejected ones, regularized so the model doesn't drift too far.

Why people switched:

Simpler. One training step, no reward model, no PPO. Far less code to maintain.
More stable. The pathological reward-hacking failures of RLHF mostly don't happen with DPO.
Cheaper. Roughly 2-3× less compute for similar quality results.
Easier to iterate. You can run more variants quickly.

Llama 3 used DPO. Mistral uses DPO. Most open-weights models use DPO or its variants. Anthropic and OpenAI use a mix of techniques internally and don't fully disclose; both probably use a DPO-like direct optimization plus other proprietary methods.

What they actually feel like in the resulting model

This is where it gets interesting from a user's perspective. The choice of preference data and technique shapes what the model is good at:

Heavy RLHF on safety preferences → model that's overly cautious, refuses borderline questions, adds disclaimers.
Heavy DPO on helpfulness preferences → model that almost always tries to answer, sometimes wrongly because it's been trained to never say 'I don't know'.
Preference data from a narrow group of annotators → model with that group's blind spots and stylistic tics. Why models often have similar 'voice' in 2024-2025: they were largely tuned with similar contractor-pool annotators.
Constitutional AI / RLAIF (RL from AI feedback) — Anthropic's variant where AI critique replaces some human preference labels. Cheaper but the AI critic introduces its own biases.

Newer variants you'll see mentioned

IPO (Identity Preference Optimization) — DPO variant that handles the problem when preferred and rejected answers are very similar.
KTO (Kahneman-Tversky Optimization) — uses just 'good / bad' labels instead of A vs B comparisons, which is sometimes cheaper to collect.
ORPO (Odds Ratio Preference Optimization) — combines SFT and preference learning into a single step.
GRPO (Group Relative Policy Optimization) — used by DeepSeek for reasoning models like R1 and V3, optimizes against multiple sampled outputs.

For non-researchers, the takeaway is: 'preference optimization' is a family of techniques, and labs are constantly trying new ones. Different families produce subtly different model personalities and capabilities.

What this means if you're not training models

Model behavior is a design choice, not a fact. When Claude refuses something GPT-5 answers, that's a preference data choice the labs made. When a new model version 'feels different', the SFT and preference data probably changed.

Open-source means tunable. You can DPO-tune an open-weights model like Llama 3 or Qwen on your own preference data with consumer hardware. This is how 'uncensored' or domain-specialized models get made. Tools like axolotl and trl make it accessible.

Preference data is the real moat. Compute and architectures are increasingly commoditized. The differentiator between Claude / GPT / Gemini in 2026 is largely 'who has better preference data and a better feedback loop with users'.

When NOT to care about this

If you're using LLMs through an API and not training your own, you don't need to know the inner workings of RLHF vs DPO to ship product. What you do need: the awareness that the model's tone, refusal pattern, and quirks come from these processes, not from some immutable 'AI character'. If a model is too sycophantic for your use case, switch models — that's fixable.