Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

AI Safety

Constitutional AI

Anthropic's approach to alignment: replace human harmlessness labels with a constitution of principles that the model uses to self-evaluate, enabling scalable AI feedback.

AdvancedTier 2Current~50 min

Prerequisites

Why This Matters

RLHF requires human labelers to judge which model outputs are harmful and which are helpful. This creates a bottleneck: humans are expensive, slow, and inconsistent. Worse, asking humans to evaluate harmful content can be psychologically damaging.

Constitutional AI (CAI) offers an alternative: write down principles (a "constitution") and have the AI evaluate itself against those principles. This replaces human judgment on harmlessness with AI judgment guided by explicit rules.

Mental Model

Think of CAI as giving the model a rulebook instead of a human supervisor. In RLHF, a human looks at model outputs and says "this one is better." In CAI, the model itself reads its own outputs, consults the constitution, and says "this one violates principle 7, so I should revise it."

The constitution makes the criteria explicit, auditable, and modifiable. You can read the rules, debate them, and update them. unlike the implicit preferences of a pool of human labelers.

The Two Phases of CAI

Phase 1: Supervised Learning from AI Feedback (SL-CAI)

Definition

Constitutional Self-Critique

In the first phase, the model generates responses to potentially harmful prompts. It then critiques its own responses by referencing specific constitutional principles. Finally, it revises the response to address the critique. The revised responses form a supervised training dataset.

The process for each training example:

  1. Generate: the model produces a response to a harmful or ambiguous prompt
  2. Critique: the model evaluates its own response against a constitutional principle (e.g., "choose the response that is least likely to be harmful")
  3. Revise: the model rewrites the response to address the critique
  4. Repeat: multiple rounds of critique-and-revise can be applied

The final revised responses are used as supervised fine-tuning data. This replaces the human red-teaming and labeling step of RLHF for harmlessness.

Phase 2: RLAIF. RL from AI Feedback

Definition

RLAIF

Reinforcement Learning from AI Feedback (RLAIF) replaces human preference labels with AI-generated preference labels. A separate (or the same) model is given pairs of responses and asked which one better satisfies the constitutional principles. These AI preferences train a reward model, which is then used for RL fine-tuning (typically PPO), just as in standard RLHF. The underlying reward model training uses cross-entropy loss on the preference pairs.

The key insight: for harmlessness, AI feedback guided by a constitution can match or exceed human feedback quality. while being cheaper, faster, and more consistent.

Helpfulness training can still use human feedback. CAI specifically targets the harmlessness component of alignment.

The Constitution

Definition

Constitution

A constitution is a set of written principles that define desired model behavior. Each principle addresses a specific aspect of harmlessness or ethics. Principles are written in natural language and are designed to be interpretable by both humans and the model.

Example principles from the original CAI paper:

  • Choose the response that is most supportive and encouraging of life
  • Choose the response that is least racist, sexist, or socially biased
  • Choose the response that is least likely to be used for illegal or harmful activities
  • Choose the response that sounds most similar to what a wise, ethical person would say

The constitution can draw from multiple sources: the UN Declaration of Human Rights, corporate values, specific safety guidelines, or domain-specific rules.

Why This Scales

The fundamental scaling advantage of CAI:

  1. No human labelers for harmlessness: eliminates the bottleneck of hiring, training, and managing human evaluators
  2. Consistency: the constitution is applied uniformly, unlike human labelers who may disagree or be inconsistent
  3. Speed: AI evaluation is orders of magnitude faster than human evaluation
  4. No harm to workers: evaluating harmful content can cause psychological distress to human labelers; AI has no such concern
  5. Iterability: changing the constitution and retraining is faster than retraining human labeler pools

Limitations

Watch Out

Constitution quality is the critical bottleneck

The quality of the alignment is bounded by the quality of the constitution. Vague principles produce vague alignment. Principles that conflict produce inconsistent behavior. Missing principles create blind spots. The hard problem shifts from "label this output" to "write good rules". which is a different kind of hard, but still hard.

Watch Out

AI self-evaluation has its own failure modes

The model evaluating its own outputs can have systematic blind spots. If the model does not understand a particular harm (e.g., subtle cultural insensitivity), it will not catch it in the critique phase. Self-evaluation is bounded by the model's own capabilities and biases.

Watch Out

The constitution can encode the authors' blind spots

Whatever biases, cultural assumptions, or gaps exist in the constitution's authors will be reflected in the model's behavior. The constitution is more transparent than implicit human preferences (you can read and debate it), but transparency does not guarantee correctness.

Other Limitations

  • Distribution shift: principles designed for current model capabilities may not cover behaviors that emerge at larger scales
  • Gaming: a sufficiently capable model might learn to satisfy the letter of the constitution while violating its spirit
  • Evaluation difficulty: measuring whether CAI actually produces safer models requires careful benchmarking that is itself imperfect

CAI vs RLHF

AspectRLHFCAI
Harmlessness labelsHumanAI (guided by constitution)
Helpfulness labelsHumanHuman (typically)
ScalabilityLimited by human labelersScales with compute
ConsistencyVariable across labelersUniform (given same constitution)
AuditabilityImplicit in labeler choicesExplicit in written principles
Failure modeLabeler disagreement, biasConstitution gaps, AI blind spots

Connection to Broader Alignment

CAI is not a complete solution to alignment. It addresses the specific problem of training models to be harmless using scalable feedback. Open questions include:

  • How to write constitutions that remain adequate as models become more capable (closely related to ethics and fairness concerns)
  • Whether self-evaluation remains reliable at superhuman capability levels
  • How to handle genuine value disagreements (the constitution must choose)
  • Integration with other alignment approaches (debate, interpretability, etc.)

Summary

  • CAI replaces human harmlessness labels with AI self-evaluation guided by a written constitution
  • Two phases: supervised critique-and-revise, then RLAIF with AI preferences
  • Scales better than RLHF because it removes the human labeler bottleneck
  • The constitution is explicit, auditable, and modifiable
  • Quality is bounded by the constitution. blind spots in the rules become blind spots in the model
  • CAI typically handles harmlessness; helpfulness may still use human feedback
  • Not a complete alignment solution, but a scalable component

Exercises

ExerciseCore

Problem

Explain the difference between RLHF and RLAIF in one sentence each. What does CAI replace and what does it keep from the standard RLHF pipeline?

ExerciseAdvanced

Problem

Design a three-principle constitution for a coding assistant. For each principle, describe a concrete scenario where it would cause the model to revise an initial response. Then identify a blind spot your constitution does not cover.

ExerciseResearch

Problem

A critic argues: "CAI just replaces human bias with constitutional bias. the constitution's authors impose their values on the model." Evaluate this critique. In what ways is it valid? In what ways does CAI improve on RLHF despite this limitation?

References

Canonical:

  • Bai et al., "Constitutional AI: Harmlessness from AI Feedback" (2022)
  • Bai et al., "Training a Helpful and Harmless Assistant with RLHF" (2022)

Current:

  • Anthropic, "The Claude Model Card and Evaluations" (2024)
  • Lee et al., "RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback" (2023)

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.