AI Safety
Constitutional AI
Anthropic's approach to alignment: replace human harmlessness labels with a constitution of principles that the model uses to self-evaluate, enabling scalable AI feedback.
Prerequisites
Why This Matters
RLHF requires human labelers to judge which model outputs are harmful and which are helpful. This creates a bottleneck: humans are expensive, slow, and inconsistent. Worse, asking humans to evaluate harmful content can be psychologically damaging.
Constitutional AI (CAI) offers an alternative: write down principles (a "constitution") and have the AI evaluate itself against those principles. This replaces human judgment on harmlessness with AI judgment guided by explicit rules.
Mental Model
Think of CAI as giving the model a rulebook instead of a human supervisor. In RLHF, a human looks at model outputs and says "this one is better." In CAI, the model itself reads its own outputs, consults the constitution, and says "this one violates principle 7, so I should revise it."
The constitution makes the criteria explicit, auditable, and modifiable. You can read the rules, debate them, and update them. unlike the implicit preferences of a pool of human labelers.
The Two Phases of CAI
Phase 1: Supervised Learning from AI Feedback (SL-CAI)
Constitutional Self-Critique
In the first phase, the model generates responses to potentially harmful prompts. It then critiques its own responses by referencing specific constitutional principles. Finally, it revises the response to address the critique. The revised responses form a supervised training dataset.
The process for each training example:
- Generate: the model produces a response to a harmful or ambiguous prompt
- Critique: the model evaluates its own response against a constitutional principle (e.g., "choose the response that is least likely to be harmful")
- Revise: the model rewrites the response to address the critique
- Repeat: multiple rounds of critique-and-revise can be applied
The final revised responses are used as supervised fine-tuning data. This replaces the human red-teaming and labeling step of RLHF for harmlessness.
Phase 2: RLAIF. RL from AI Feedback
RLAIF
Reinforcement Learning from AI Feedback (RLAIF) replaces human preference labels with AI-generated preference labels. A separate (or the same) model is given pairs of responses and asked which one better satisfies the constitutional principles. These AI preferences train a reward model, which is then used for RL fine-tuning (typically PPO), just as in standard RLHF. The underlying reward model training uses cross-entropy loss on the preference pairs.
The key insight: for harmlessness, AI feedback guided by a constitution can match or exceed human feedback quality. while being cheaper, faster, and more consistent.
Helpfulness training can still use human feedback. CAI specifically targets the harmlessness component of alignment.
The Constitution
Constitution
A constitution is a set of written principles that define desired model behavior. Each principle addresses a specific aspect of harmlessness or ethics. Principles are written in natural language and are designed to be interpretable by both humans and the model.
Example principles from the original CAI paper:
- Choose the response that is most supportive and encouraging of life
- Choose the response that is least racist, sexist, or socially biased
- Choose the response that is least likely to be used for illegal or harmful activities
- Choose the response that sounds most similar to what a wise, ethical person would say
The constitution can draw from multiple sources: the UN Declaration of Human Rights, corporate values, specific safety guidelines, or domain-specific rules.
Why This Scales
The fundamental scaling advantage of CAI:
- No human labelers for harmlessness: eliminates the bottleneck of hiring, training, and managing human evaluators
- Consistency: the constitution is applied uniformly, unlike human labelers who may disagree or be inconsistent
- Speed: AI evaluation is orders of magnitude faster than human evaluation
- No harm to workers: evaluating harmful content can cause psychological distress to human labelers; AI has no such concern
- Iterability: changing the constitution and retraining is faster than retraining human labeler pools
Limitations
Constitution quality is the critical bottleneck
The quality of the alignment is bounded by the quality of the constitution. Vague principles produce vague alignment. Principles that conflict produce inconsistent behavior. Missing principles create blind spots. The hard problem shifts from "label this output" to "write good rules". which is a different kind of hard, but still hard.
AI self-evaluation has its own failure modes
The model evaluating its own outputs can have systematic blind spots. If the model does not understand a particular harm (e.g., subtle cultural insensitivity), it will not catch it in the critique phase. Self-evaluation is bounded by the model's own capabilities and biases.
The constitution can encode the authors' blind spots
Whatever biases, cultural assumptions, or gaps exist in the constitution's authors will be reflected in the model's behavior. The constitution is more transparent than implicit human preferences (you can read and debate it), but transparency does not guarantee correctness.
Other Limitations
- Distribution shift: principles designed for current model capabilities may not cover behaviors that emerge at larger scales
- Gaming: a sufficiently capable model might learn to satisfy the letter of the constitution while violating its spirit
- Evaluation difficulty: measuring whether CAI actually produces safer models requires careful benchmarking that is itself imperfect
CAI vs RLHF
| Aspect | RLHF | CAI |
|---|---|---|
| Harmlessness labels | Human | AI (guided by constitution) |
| Helpfulness labels | Human | Human (typically) |
| Scalability | Limited by human labelers | Scales with compute |
| Consistency | Variable across labelers | Uniform (given same constitution) |
| Auditability | Implicit in labeler choices | Explicit in written principles |
| Failure mode | Labeler disagreement, bias | Constitution gaps, AI blind spots |
Connection to Broader Alignment
CAI is not a complete solution to alignment. It addresses the specific problem of training models to be harmless using scalable feedback. Open questions include:
- How to write constitutions that remain adequate as models become more capable (closely related to ethics and fairness concerns)
- Whether self-evaluation remains reliable at superhuman capability levels
- How to handle genuine value disagreements (the constitution must choose)
- Integration with other alignment approaches (debate, interpretability, etc.)
Summary
- CAI replaces human harmlessness labels with AI self-evaluation guided by a written constitution
- Two phases: supervised critique-and-revise, then RLAIF with AI preferences
- Scales better than RLHF because it removes the human labeler bottleneck
- The constitution is explicit, auditable, and modifiable
- Quality is bounded by the constitution. blind spots in the rules become blind spots in the model
- CAI typically handles harmlessness; helpfulness may still use human feedback
- Not a complete alignment solution, but a scalable component
Exercises
Problem
Explain the difference between RLHF and RLAIF in one sentence each. What does CAI replace and what does it keep from the standard RLHF pipeline?
Problem
Design a three-principle constitution for a coding assistant. For each principle, describe a concrete scenario where it would cause the model to revise an initial response. Then identify a blind spot your constitution does not cover.
Problem
A critic argues: "CAI just replaces human bias with constitutional bias. the constitution's authors impose their values on the model." Evaluate this critique. In what ways is it valid? In what ways does CAI improve on RLHF despite this limitation?
References
Canonical:
- Bai et al., "Constitutional AI: Harmlessness from AI Feedback" (2022)
- Bai et al., "Training a Helpful and Harmless Assistant with RLHF" (2022)
Current:
- Anthropic, "The Claude Model Card and Evaluations" (2024)
- Lee et al., "RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback" (2023)
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- RLHF and AlignmentLayer 4
- Policy Gradient TheoremLayer 3
- Markov Decision ProcessesLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A