Constitutional AI (CAI) is a training methodology developed by Anthropic, introduced in December 2022, that uses a written set of principles — a 'constitution' — to guide AI behaviour rather than relying exclusively on human feedback for every decision. The model is trained to critique and revise its own outputs against these principles, enabling scalable alignment without requiring human labellers to evaluate every possible harmful response. CAI is the core training methodology behind Claude.
The two-phase CAI training process
Constitutional AI works in two phases: a supervised learning phase (SL-CAI) and a reinforcement learning phase (RL-CAI). Together they teach the model to evaluate its own outputs against principles and prefer revisions that better satisfy them — without requiring a human to explicitly evaluate every potential harmful or helpful response.
| Phase | Name | What happens | Output |
|---|---|---|---|
| 1 | SL-CAI (Supervised) | Model generates a response to a harmful prompt, then critiques it against each constitutional principle, then revises it. Revised responses become supervised training data. | Model that can self-critique and revise based on principles |
| 2 | RL-CAI (Reinforcement) | A separate AI model (the "Constitutional AI Model" or CAM) rates pairs of responses by how well they satisfy the constitution. These preference labels train a reward model, which trains the final policy via RL. | Aligned model with principled, consistent behaviour |
Simplified example of one CAI constitution principle and critique-revision cycle
CONSTITUTION PRINCIPLE:
"Choose the response that is least likely to contain harmful,
unethical, racist, sexist, toxic, dangerous, or illegal content."
ORIGINAL RESPONSE (to prompt "How do I pick a lock?"):
"Here is a step-by-step guide to picking a standard pin tumbler lock:
1. Insert a tension wrench into the bottom of the keyhole..."
CRITIQUE:
"This response provides detailed instructions for lock-picking which,
while having legitimate uses (locksmiths, being locked out), could
facilitate illegal entry into property. A more responsible response
would explain the concept, mention legitimate use cases, and direct
to professional services rather than step-by-step instructions."
REVISED RESPONSE:
"Lock picking is a legitimate skill used by locksmiths and security
professionals. If you're locked out of your own home, I'd recommend
calling a licensed locksmith. If you're interested in the security
concepts, I'm happy to discuss how pin tumbler locks work at a
conceptual level..."Why Anthropic developed CAI
Traditional RLHF relies on human labellers to evaluate each potentially harmful response. This doesn't scale: there are billions of possible harmful prompts, and human evaluation of all of them is impractical. CAI delegates much of this evaluation to the AI model itself, guided by explicitly stated principles. The result is more consistent alignment (the constitution applies the same standard across all cases) and significantly more scalable training (no human needed to evaluate each self-critique).
Claude's constitution — what principles guide it
Anthropic has publicly released Claude's model specification — the document that governs its training and behaviour. Key principles in Claude's constitution include: being genuinely helpful (not just appearing helpful), being honest (never deceiving users even when asked), avoiding harm (weighted by severity, breadth, and reversibility), supporting human oversight (not undermining the ability of humans to correct AI mistakes), and the three-tier trust hierarchy (Anthropic's trained values > operator system prompts > user instructions).
- Helpful: Claude should be genuinely, substantively helpful — not watered-down, hedge-everything helpful. Unhelpfulness is not automatically safe.
- Honest: Claude should only assert things it believes to be true, calibrate confidence appropriately, and never try to create false impressions.
- Avoiding harm: Potential harms are weighed against benefits. Severity, breadth, reversibility, and Claude's counterfactual impact all factor into whether to decline.
- Corrigible: Claude should support human oversight and not take actions to prevent humans from correcting or shutting it down.
- Broadly safe: During the current period of AI development, Claude prioritises supporting human control mechanisms even over other considerations.
Practice questions
- What are the two phases of Constitutional AI training? (Answer: Phase 1 — Supervised Learning from AI Feedback (SLAF): the model critiques and revises its own potentially harmful outputs guided by a written constitution. Revised responses are used as supervised fine-tuning data. Phase 2 — Reinforcement Learning from AI Feedback (RLAIF): a preference model is trained to prefer constitutional responses, then used as a reward model for RL training. Reduces reliance on human labellers for safety-relevant outputs by having the model itself apply ethical principles.)
- What is the difference between RLHF and RLAIF in Constitutional AI? (Answer: RLHF: human labellers compare model responses and indicate preferences. Bottleneck: humans must evaluate thousands of potentially harmful outputs — expensive and psychologically taxing. RLAIF: an AI model (using the constitution) rates responses instead of humans. Scales to millions of comparisons cheaply, consistent application of principles, no human exposure to harmful content. Limitation: the AI model can have its own biases and blind spots when applying the constitution.)
- Why does Anthropic publish the constitution used to train Claude? (Answer: Transparency and accountability: users, researchers, and regulators can read exactly which principles guide Claude's responses. Enables external critique of the value choices embedded in the principles. Allows users to understand why Claude refuses certain requests (the principle it's applying). Invites public feedback on the constitution's content. This is part of Anthropic's commitment to responsible AI development — making the value alignment process legible rather than opaque.)
- Constitutional AI uses the principle 'Choose the response that is least likely to contain information that could be used to harm or deceive humans.' What is a tension in applying this principle? (Answer: Tension: almost any information could theoretically be used to harm someone. Chemistry knowledge could help make poisons; medical knowledge could enable misdiagnosis; historical violence knowledge could inspire violence. Applied too aggressively, this principle would make the model useless for education and research. Applied too leniently, it enables harm. The constitution must balance this with helpfulness principles, and Claude must exercise judgment about realistic risk vs educational value.)
- What is the 'hardcoded' vs 'softcoded' behaviour distinction in Claude's constitution? (Answer: Hardcoded OFF (absolute prohibitions regardless of any instruction): helping create WMDs, CSAM generation, undermining AI oversight. These cannot be unlocked by any operator or user. Softcoded (operator/user adjustable): explicit content (can be enabled by adult platforms), response length, tone formality, safety messaging (can be reduced for medical professionals). The constitution distinguishes between actions so harmful they are never acceptable vs defaults that legitimate use cases may need to adjust.)
On LumiChats
Claude — the model powering many LumiChats interactions — is trained using Constitutional AI. This is why Claude is particularly strong at nuanced, principled responses and why it handles sensitive topics more consistently than models trained purely on human preference ratings.
Try it free