Constitutional AI

Sneiderman, Robby

AI Safety

Constitutional AI

Anthropic's approach to alignment: replace human harmlessness labels with a constitution of principles that the model uses to self-evaluate, enabling scalable AI feedback.

AdvancedTier 2CurrentSupporting~50 min

Prerequisites

RLHF and Alignment Reinforcement Learning From Human Feedback Deep Dive Reward Hacking

Start 8-question practice · 3 available Prereq Map

Learning position

Read this page in the graph.

ai-safety | layer 5 | tier 2. This page has 3 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

RLHF requires human labelers to judge which model outputs are harmful and which are helpful. This creates a bottleneck: humans are expensive, slow, and inconsistent. Worse, asking humans to evaluate harmful content can be psychologically damaging.

Constitutional AI (CAI) offers an alternative: write down principles (a "constitution") and have the AI evaluate itself against those principles. This replaces human judgment on harmlessness with AI judgment guided by explicit rules.

Mental Model

Think of CAI as giving the model a rulebook instead of a human supervisor. In RLHF, a human looks at model outputs and says "this one is better." In CAI, the model itself reads its own outputs, consults the constitution, and says "this one violates principle 7, so I should revise it."

The constitution makes the criteria explicit, auditable, and modifiable. You can read the rules, debate them, and update them. unlike the implicit preferences of a pool of human labelers.

The Two Phases of CAI

Phase 1: Supervised Learning from AI Feedback (SL-CAI)

Definition

Constitutional Self-Critique

In the first phase, the model generates responses to potentially harmful prompts. It then critiques its own responses by referencing specific constitutional principles. Finally, it revises the response to address the critique. The revised responses form a supervised training dataset.

The process for each training example:

Generate: the model produces a response to a harmful or ambiguous prompt
Critique: the model evaluates its own response against a constitutional principle (e.g., "choose the response that is least likely to be harmful")
Revise: the model rewrites the response to address the critique
Repeat: multiple rounds of critique-and-revise can be applied

The final revised responses are used as supervised fine-tuning data. This replaces the human red-teaming and labeling step of RLHF for harmlessness.

Phase 2: RLAIF. RL from AI Feedback

Definition

RLAIF

Reinforcement Learning from AI Feedback (RLAIF) replaces human preference labels with AI-generated preference labels. A separate (or the same) model is given pairs of responses and asked which one better satisfies the constitutional principles. These AI preferences train a reward model, which is then used for RL fine-tuning (typically PPO), just as in standard RLHF. The underlying reward model training uses cross-entropy loss on the preference pairs.

The Bradley-Terry preference model used for the reward step is identical to RLHF: given prompt $x$ and a winner / loser pair $(y_w, y_l)$ ,

$p(y_w \succ y_l \mid x) = \sigma\bigl(r_\phi(x, y_w) - r_\phi(x, y_l)\bigr)$

and the reward model $r_\phi$ is trained by maximum likelihood on the AI-labeled preferences:

$\mathcal{L}_{\text{RM}}(\phi) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}_{\text{AI}}}\bigl[\log \sigma\bigl(r_\phi(x, y_w) - r_\phi(x, y_l)\bigr)\bigr].$

The only thing that changes from RLHF to RLAIF is the source of $\mathcal{D}$ : human raters in RLHF, model judgments conditioned on a constitution in RLAIF. The PPO loop and KL-regularized policy update are unchanged.

The key insight: for harmlessness, AI feedback guided by a constitution can match or exceed human feedback quality. while being cheaper, faster, and more consistent.

Helpfulness training can still use human feedback. CAI specifically targets the harmlessness component of alignment.

The Constitution

Definition

Constitution

A constitution is a set of written principles that define desired model behavior. Each principle addresses a specific aspect of harmlessness or ethics. Principles are written in natural language and are designed to be interpretable by both humans and the model.

Example principles from the original CAI paper:

Choose the response that is most supportive and encouraging of life
Choose the response that is least racist, sexist, or socially biased
Choose the response that is least likely to be used for illegal or harmful activities
Choose the response that sounds most similar to what a wise, ethical person would say

The constitution can draw from multiple sources: the UN Declaration of Human Rights, corporate values, specific safety guidelines, or domain-specific rules.

Why This Scales

The fundamental scaling advantage of CAI:

No human labelers for harmlessness: eliminates the bottleneck of hiring, training, and managing human evaluators
Consistency: the constitution is applied uniformly, unlike human labelers who may disagree or be inconsistent
Speed: AI evaluation is orders of magnitude faster than human evaluation
No harm to workers: evaluating harmful content can cause psychological distress to human labelers; AI has no such concern
Iterability: changing the constitution and retraining is faster than retraining human labeler pools

Limitations

Watch Out

Constitution quality is the critical bottleneck

The quality of the alignment is bounded by the quality of the constitution. Vague principles produce vague alignment. Principles that conflict produce inconsistent behavior. Missing principles create blind spots. The hard problem shifts from "label this output" to "write good rules". which is a different kind of hard, but still hard.

Watch Out

AI self-evaluation has its own failure modes

The model evaluating its own outputs can have systematic blind spots. If the model does not understand a particular harm (e.g., subtle cultural insensitivity), it will not catch it in the critique phase. Self-evaluation is bounded by the model's own capabilities and biases.

Watch Out

The constitution can encode the authors' blind spots

Whatever biases, cultural assumptions, or gaps exist in the constitution's authors will be reflected in the model's behavior. The constitution is more transparent than implicit human preferences (you can read and debate it), but transparency does not guarantee correctness.

Other Limitations

Distribution shift: principles designed for current model capabilities may not cover behaviors that emerge at larger scales
Gaming: a sufficiently capable model might learn to satisfy the letter of the constitution while violating its spirit
Evaluation difficulty: measuring whether CAI actually produces safer models requires careful benchmarking that is itself imperfect

CAI vs RLHF

Aspect	RLHF	CAI
Harmlessness labels	Human	AI (guided by constitution)
Helpfulness labels	Human	Human (typically)
Scalability	Limited by human labelers	Scales with compute
Consistency	Variable across labelers	Uniform (given same constitution)
Auditability	Implicit in labeler choices	Explicit in written principles
Failure mode	Labeler disagreement, bias	Constitution gaps, AI blind spots

The Helpfulness-Harmlessness Tradeoff

Bai et al. (2022a, "Training a Helpful and Harmless Assistant") showed that naively optimizing a single reward model trained on both helpfulness and harmlessness preferences produces a Pareto frontier: models become more harmless at the cost of helpfulness, often degenerating into evasive responses ("I cannot help with that") on benign prompts.

CAI's empirical claim (Bai et al. 2022b, Figure 2) is that splitting the reward into a helpfulness reward $r_H$ trained on human preferences and a harmlessness reward $r_S$ trained on AI / constitutional preferences shifts the frontier: the resulting model is comparable in harmlessness to the strongest RLHF harmlessness model while being more helpful at the same harmlessness level.

The mechanism is that AI feedback is more permissive on borderline-but-benign prompts than human harmlessness raters, who are incentivized to play it safe by refusing more often.

Two related effects observed in subsequent work:

Sycophancy (Sharma et al. 2024). RLHF-style preference learning rewards responses that agree with the user, even when the user is wrong. Constitutional feedback inherits this if the constitution does not explicitly penalize sycophantic agreement.
Specification gaming on the constitution itself. A capable model can satisfy the literal constitution while violating its spirit -- e.g., refusing on the surface while still leaking the harmful content via a stylized format.

Related Approaches

CAI sits in a family of scalable-oversight methods that try to amplify human feedback:

Sparrow (Glaese et al. 2022, DeepMind). Trains a dialogue agent against a set of explicit rule-classifiers (one per rule) plus a preference model; conceptually parallel to CAI but uses dedicated rule classifiers rather than an LLM judge prompted with a constitution.
Debate (Irving, Christiano, Amodei 2018). Two AI debaters argue opposing sides of a question; a human judges the resulting transcript. The hope is that adversarial dynamics make subtle deception easier to detect.
Recursive Reward Modeling (Leike et al. 2018). Train a sequence of reward models on increasingly complex tasks, where each step uses the previous reward model as part of the supervision signal.
Process Reward Models (Lightman et al. 2023, "Let's Verify Step by Step", arXiv:2305.20050). Reward intermediate reasoning steps rather than only final answers; a finer-grained signal that complements outcome-based RLAIF.
Constitutional Classifiers (Sharma et al., Anthropic 2024, arXiv:2401.06080). Trains lightweight constitution-aware classifiers as inference-time guards on top of a constitutionally trained model, narrowing the jailbreak surface that survives training.

Connection to Broader Alignment

CAI is not a complete solution to alignment. It addresses the specific problem of training models to be harmless using scalable feedback. Open questions include:

How to write constitutions that remain adequate as models become more capable (closely related to ethics and fairness concerns)
Whether self-evaluation remains reliable at superhuman capability levels
How to handle genuine value disagreements (the constitution must choose)
Integration with other alignment approaches (debate, interpretability, etc.)
Robustness to adversarial training-time backdoors. Hubinger et al. (2024, "Sleeper Agents") show that deceptive behavior planted at pretraining or fine-tuning time can survive standard RLHF/CAI safety training, including constitutional self-critique. CAI does not, on its own, defeat training-time supply-chain attacks.

Summary

CAI replaces human harmlessness labels with AI self-evaluation guided by a written constitution
Two phases: supervised critique-and-revise, then RLAIF with AI preferences
Scales better than RLHF because it removes the human labeler bottleneck
The constitution is explicit, auditable, and modifiable
Quality is bounded by the constitution. blind spots in the rules become blind spots in the model
CAI typically handles harmlessness; helpfulness may still use human feedback
Not a complete alignment solution, but a scalable component

Exercises

ExerciseCore

Problem

Explain the difference between RLHF and RLAIF in one sentence each. What does CAI replace and what does it keep from the standard RLHF pipeline?

ExerciseAdvanced

Problem

Design a three-principle constitution for a coding assistant. For each principle, describe a concrete scenario where it would cause the model to revise an initial response. Then identify a blind spot your constitution does not cover.

ExerciseResearch

Problem

A critic argues: "CAI just replaces human bias with constitutional bias. the constitution's authors impose their values on the model." Evaluate this critique. In what ways is it valid? In what ways does CAI improve on RLHF despite this limitation?

References

Canonical:

Bai et al., "Constitutional AI: Harmlessness from AI Feedback" (Anthropic, arXiv:2212.08073, 2022) -- the canonical CAI paper. §2 covers SL-CAI, §3 covers RLAIF, §4 reports the helpfulness-harmlessness Pareto frontier.
Bai et al., "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback" (Anthropic, arXiv:2204.05862, 2022) -- the RLHF baseline that CAI is measured against. Defines the helpfulness and harmlessness reward decomposition.
Christiano et al., "Deep Reinforcement Learning from Human Preferences" (NeurIPS 2017) -- the Bradley-Terry preference-modeling foundation that both RLHF and RLAIF inherit.

Related approaches:

Glaese et al., "Improving alignment of dialogue agents via targeted human judgements (Sparrow)" (DeepMind, arXiv:2209.14375, 2022) -- rule-based scalable feedback in the same generation as CAI.
Irving, Christiano, Amodei, "AI Safety via Debate" (arXiv:1805.00899, 2018).
Leike et al., "Scalable agent alignment via reward modeling" (arXiv:1811.07871, 2018) -- the recursive reward modeling proposal.
Lightman et al., "Let's Verify Step by Step" (OpenAI, arXiv:2305.20050, 2023) -- the canonical process-reward-model paper.

Current:

Anthropic, "The Claude Model Card and Evaluations" (2024)
Lee et al., "RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback" (Google, arXiv:2309.00267, 2023) -- independent replication and extension of the CAI feedback-substitution claim.
Sharma et al., "Towards Understanding Sycophancy in Language Models" (Anthropic, arXiv:2310.13548, 2024) -- documents and analyses sycophancy as a preference-learning failure mode that CAI inherits without explicit countermeasure.

Frontier:

Hubinger et al., "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training" (Anthropic, arXiv:2401.05566, 2024) -- shows that deceptive policies planted at training time can survive standard CAI / RLHF pipelines, bounding the guarantees CAI alone provides.
Sharma et al., "Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming" (Anthropic, arXiv:2401.06080, 2024) -- inference-time classifier guard that complements training-time CAI; the strongest extant report on sustained jailbreak resistance.

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Reinforcement Learning from Human Feedbacklayer 5 · tier 1
RLHF and Alignmentlayer 4 · tier 2
Reward Hackinglayer 5 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.