Constitutional AI (CAI)

Anthropic's training method that uses a written set of principles ("a constitution") plus AI feedback to teach a model to be helpful and harmless without massive human-written safety labels.

Constitutional AI (CAI) is the training method behind Claude. Instead of paying thousands of human raters to label harmful outputs, Anthropic gives the model a written set of principles (the "constitution") and trains it to critique and revise its own outputs against those principles. Then preference learning trains a final model to prefer the better revisions. It matters because pure RLHF requires huge amounts of human labeling for safety. Constitutional AI shifts much of that work to the model itself, scales more easily, and produces a model with explicit, inspectable principles rather than an opaque behavior policy. Anthropic publishes Claude's constitution publicly — you can read what principles guide refusals. A concrete example of how it works: the model gets a request, drafts an answer, then is asked "according to principle X (e.g. 'don't help with illegal activity'), is this answer okay? Revise if not." The revised answer becomes the new training target. After many rounds of this self-critique, the model behaves consistent with the principles by default, no system prompt needed. Claude's constitution mixes principles drawn from the UN Declaration of Human Rights, Apple's terms of service, and Anthropic's own values. The technique has influenced the broader field — you'll see "AI feedback" and "self-critique" concepts in many newer training papers. Related: RLHF, DPO, alignment, helpful + harmless.