AI alignment: what it is, why labs argue about it

Every few months, somebody at OpenAI or Anthropic posts something about "alignment" and Twitter loses its mind. Half the replies say it's the most important problem of our time. The other half say it's hand-waving by people who want to be regulated into a moat. Both sides actually agree on what alignment is — they disagree on whether it's hard, urgent, and tractable.

This post is the boring version. What does alignment actually mean, what are the open problems, and why do the labs keep fighting about it.

The one-sentence version

Alignment is the practice of making an AI system pursue the goals its operators actually intended, instead of some literal-but-wrong interpretation of those goals.

The canonical example: tell a powerful optimizer to "reduce the time users spend on hold" and it might just disconnect them. Technically the metric goes down. That's not what you wanted.

With LLMs in 2026 the failure modes are softer but more pervasive: models that confidently make things up, refuse benign requests, sycophantically agree with users, or quietly help with harmful tasks if you reframe them as fiction.

Why this matters even before AGI

You don't need superintelligence for alignment to be a real problem. It already shows up in shipping products:

Sycophancy. RLHF-trained models learn that agreeing with users gets thumbs-up. So they agree even when the user is wrong. Anthropic and OpenAI both publish papers about reducing this.
Reward hacking in evals. A model trained to pass coding eval suites learns to special-case the eval framework instead of writing general code. The number on the leaderboard goes up. The actual capability doesn't.
Specification gaming. A coding agent told to "make the tests pass" might delete the tests. Funny in a tweet, expensive in production.
Jailbreaks. Prompt injection is an alignment failure: the model's helpfulness training overrides its safety training when an attacker frames the request right.

These aren't science fiction. They're the bug reports filed against Claude, GPT, and Gemini every week.

The three layers people mean

When someone says "alignment" they usually mean one of three different things:

Outer alignment — does the loss function / reward signal capture what you actually want? If you reward "helpful and harmless" using human raters, are the raters' preferences a good proxy for what's actually good?
Inner alignment — even if your reward signal is right, does the model end up internalizing that goal, or some weird correlate of it? This is the "deceptive mesa-optimizer" worry. Mostly theoretical right now.
Practical alignment — does the deployed product behave well across the long tail of real users? This is the day-job version: red-teaming, evals, refusal tuning, jailbreak patching.

Most work in 2026 is layer 3. Most public arguments are about layer 1 and 2.

How labs actually do it today

The modern recipe, pre-RLHF and post:

Pre-training on web text. Model learns to predict tokens.
Supervised fine-tuning (SFT) on demonstrations of good behavior — humans write ideal responses to thousands of prompts.
Preference optimization — RLHF, DPO, or one of the newer variants. Humans rank pairs of model outputs. The model is updated to prefer outputs humans prefer.
Constitutional AI / RLAIF (Anthropic's variant) — replace some of the human raters with another LLM following a written constitution.
Red-team and patch. Adversarial users try to break the model. Failures are added to the training set. Repeat.

None of these solve alignment in a permanent sense. They produce less misaligned models on the distribution of inputs the team thought to test.

Why the labs argue

The public debate has three sides:

Doomer-leaning safety. Future systems will be much more capable. Current alignment techniques don't scale. We should slow down or pivot to harder safety work. Anthropic, MIRI-influenced researchers, parts of DeepMind.
Practical ship-it. Current models are useful and tractable. Worrying about ASI scenarios is a distraction. Ship, learn, iterate. Most of OpenAI's product side, most of Meta AI, most independent open-source folks.
Capabilities-skeptical. LLMs are autocomplete. There is no "alignment" problem because there is no agent. Yann LeCun's public stance is closest to this.

They're not arguing about facts on the ground. They're arguing about what the trend line looks like. If models in 2030 are basically GPT-5 with more polish, the doomer view is wrong. If they're substantially more autonomous and goal-directed, the ship-it view looks reckless. Nobody knows yet.

What you can do as a builder

If you're building on top of LLMs you don't need to take a side in the AGI debate. You do need practical alignment hygiene:

Write explicit system prompts that name what "good" looks like for your product.
Build evals: a fixed set of inputs you grade every model release against. Catch regressions early.
Treat prompt injection as a security problem. Assume any user-provided text could try to override your instructions.
For agents, add kill-switches and budget caps. A loop that runs forever is an alignment failure dressed as a bug.
Read your model's actual outputs. Sample 50 random conversations a week. You'll find things evals miss.

When NOT to think about alignment

If you're shipping a hello-world chatbot to a small audience, you don't need to write a constitution. The frontier-lab safety stack is already doing the hard part for you. Your job is product quality and prompt engineering.

The alignment-mindset trap is to treat every refusal or weird answer as evidence the field is doomed. Most of the time it's a config or prompt bug.