Constitutional AI
Created: 2026-03-03 10:18
#quicknote
Constitutional AI (CAI) is Anthropic's approach to alignment that replaces human feedback with AI-generated feedback guided by explicit principles — a "constitution." Instead of relying on human annotators for preference data, CAI has the model critique and revise its own outputs against written rules, then trains a preference model from AI evaluations. This enables Reinforcement Learning from AI Feedback (RLAIF), dramatically reducing the cost and scaling constraints of RLHF - Reinforcement Learning from Human Feedback. Part of the LLM Training and Alignment Evolution.
- Phase 1 — Supervised self-critique: The model generates responses, critiques them against constitutional principles (e.g., "be helpful," "avoid harm," "be honest"), and produces revised versions. The model is fine-tuned on these improved outputs
- Phase 2 — RLAIF: The model evaluates pairs of its own outputs based on constitutional adherence, generating preference data. A preference model is trained on these AI judgements and used as the reward signal for PPO training
- Key advantage: Alignment becomes transparent (principles are explicit, auditable) and scalable (AI feedback is cheap). Achieves 88% harmless rate vs RLHF's 76% without sacrificing helpfulness
- Collective Constitutional AI (2024) — sources principles from diverse public stakeholders rather than a single design team, reducing bias across nine social dimensions while maintaining performance
- Public Constitutional AI (2025) — renders AI principles transparent and supports democratic legitimacy in deployment
Connection to Other Methods
CAI is complementary to other techniques: it can replace the human annotation step in RLHF - Reinforcement Learning from Human Feedback, the preference data collection for DPO - Direct Preference Optimization, or serve as a safety layer on top of RLVF - Reinforcement Learning from Verifiable Feedback. The principles-based approach also connects to Synthetic Data for LLM Training — constitutional critique generates synthetic preference data.
Resources
- Bai et al. 2022 — Constitutional AI
- Anthropic — Constitutional AI Research
- Collective Constitutional AI — Anthropic
- C3AI Framework (2025)
Tags
#constitutional_ai #rlaif #alignment #llm #aisecurity #training