DPO (Direct Preference Optimization)

A training method that aligns language models to human preferences directly from preference data, without needing a separate reward model or reinforcement learning.

DPO (Direct Preference Optimization) is a fine-tuning technique that teaches a language model to prefer "good" responses over "bad" ones using pairs of human-ranked answers. Instead of the multi-stage RLHF pipeline — train a reward model, then run PPO reinforcement learning — DPO collapses the whole thing into a single supervised loss function applied directly to the model. It matters because RLHF is notoriously fiddly: PPO is unstable, expensive, and hard to tune. DPO gives comparable alignment quality with simpler code, less compute, and more reproducible results. That's why it's become the default choice for aligning open-source models — Llama 3, Mistral, Zephyr, and many fine-tunes on Hugging Face use DPO or one of its variants (IPO, KTO, ORPO). The intuition: given a prompt with a "chosen" answer and a "rejected" answer, DPO nudges the model to raise the probability of the chosen one and lower the rejected one, while a reference model keeps it from drifting too far from its original behaviour. It's mathematically equivalent to optimizing the same objective RLHF targets, but you skip the reward model entirely — the model itself implicitly becomes the reward function. In practice, you need a dataset of (prompt, chosen, rejected) triples — often a few thousand to tens of thousands of examples — and a starting model that's already been instruction-tuned (SFT). One training run later, you have an aligned model. Related concepts: RLHF, PPO, reward model, SFT (supervised fine-tuning), constitutional AI, KTO, ORPO.