RLHF (Reinforcement Learning from Human Feedback)

A training technique that uses human preference judgments to teach language models which responses are helpful, honest, and safe.

RLHF stands for Reinforcement Learning from Human Feedback. It's a training technique where humans rank or compare model outputs, and those preferences are used to fine-tune the model so it generates responses people actually like. It's the step that turned raw language models like GPT-3 into chat assistants like ChatGPT. It matters because pretraining alone produces a model that just predicts the next token from internet text — it doesn't know it should be helpful, refuse harmful requests, or follow instructions. RLHF is how labs align models with human intent and safety norms. Most modern chat models (ChatGPT, Claude, Gemini) went through some form of it. The process typically has three stages: (1) supervised fine-tuning on example dialogues, (2) training a separate "reward model" on human-labeled comparisons of two responses ("which answer is better?"), and (3) using that reward model as the signal in a reinforcement learning loop (often PPO) to nudge the main model toward higher-rated outputs. Think of it like teaching a chef by having diners rate two dishes side by side, then having the chef cook more in the style of the winners. RLHF has known weaknesses: it can encourage sycophancy (telling users what they want to hear), reward hacking, and it's expensive to collect quality human labels. Newer variants try to address this — DPO (Direct Preference Optimization) skips the explicit reward model, and Anthropic's Constitutional AI / RLAIF replaces some human feedback with AI feedback guided by written principles. Related concepts: fine-tuning, reward model, PPO, DPO, Constitutional AI, alignment.