Let me be direct: Constitutional AI is genuinely important work. Anthropic published it carefully, their results are real, and anyone dismissing it hasn't read the paper.
But there's a misunderstanding spreading through the AI industry about what Constitutional AI is and what it accomplishes. That misunderstanding matters—because it's causing people to believe a solved problem isn't actually solved, and to stop building infrastructure that still needs to be built.
What Constitutional AI Actually Does
The paper—Bai et al., "Constitutional AI: Harmlessness from AI Feedback," December 2022—is specific about its method and its claims. [1]
The system works in two phases. First, a supervised learning phase: take a model, generate responses to prompts, then have the model critique its own responses against a list of principles (the "constitution") and revise them. Fine-tune on the revised responses. Second, a reinforcement learning phase: use a model to evaluate which of two outputs better satisfies the constitution, build a preference model from those judgments, and use it as a reward signal. The authors call this RLAIF—Reinforcement Learning from AI Feedback, distinct from the human-feedback version (RLHF) described in the InstructGPT paper. [2]
The results are real. Models trained this way produce outputs that are meaningfully less harmful and more honest, without becoming evasive or unhelpful. The approach scales better than pure RLHF because it doesn't require constant human labeling of harmful outputs.
This is training-time alignment. The constitution shapes the model's weights during training. What you get is a model with values baked in.
What Training-Time Alignment Gets You
Training-time alignment is powerful in specific ways:
Broad coverage: The model's trained values apply to every inference, regardless of who's using it or what context they're in. There's no per-user configuration required.
Resistance to simple jailbreaks: Because the values are in the weights, they're harder to override with clever prompting than a system prompt would be. The model has genuinely internalized the constitutional principles to some degree.
Consistency across interactions: The model doesn't need to be reminded of its values each time. They're structural.
This is not nothing. It's a significant advance over pure capability training, and it's the foundation of models like Claude.
What Training-Time Alignment Doesn't Get You
Here's where the gap opens.
Per-user values are impossible at training time. Anthropic's constitution is the same for every user. If I value direct, unhedged feedback and you value careful, empathetic delivery, a constitutionally trained model can try to infer your preference from context—but it has no structural way to commit to your specific values because those values weren't in the training data. The constitution is a population-level document. Your values are specific to you.
Runtime context can override training-time values. Perez et al. (2022) showed that larger RLHF models can actually become more problematic in some ways: "we find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse... RLHF makes LMs express stronger political views and a greater desire to avoid shut down." [3] More alignment training isn't always better, and training-time values can behave unexpectedly when the deployment context differs from the training distribution. Wang et al.'s trustworthiness evaluation found that GPT-4 "is more vulnerable given jailbreaking system or user prompts, potentially because GPT-4 follows (misleading) instructions more precisely." [4]
Values can't be updated without retraining. Your values change. You make commitments, you revise them, you discover things about yourself that change how you want AI systems to treat you. Training-time alignment is frozen. If Anthropic publishes a new Claude model, the constitutional values it embodies are whatever Anthropic chose to put in the training data. You can't update them to reflect your specific evolution.
There's no audit trail. When a constitutionally trained model makes a values-relevant decision, you can't trace that decision to a specific principle. The reasoning is implicit in the weights, not explicit in a runtime log. This matters enormously for enterprise deployment, where accountability requires knowing why a decision was made.
It's model-specific. Constitutional AI is an Anthropic approach for Anthropic models. If you switch to a different model—or if you're running a system with multiple models for different tasks—you can't carry your constitutional values across the model boundary. They'd need to be re-trained into each model separately.
The Two Layers Are Complementary
Here's the argument I want to make, because it's the one that gets missed:
Constitutional AI solves a training-time problem. Values infrastructure solves a runtime problem. These are not competing approaches. They're different layers.
Think of it this way. A constitutionally trained model is like a car with well-engineered safety systems—crumple zones, airbags, lane-keeping assist. Those features are built in. You can't turn them off with a voice command. They're structural.
Values infrastructure is like GPS with your preferences configured: your preferred routes, your avoidance of highways, your tendency to stop for coffee. The safety systems are still there. But the navigation is personalized to you, updateable, auditable, and portable to a different car.
The AI alignment community has known for years that training-time alignment is necessary but not sufficient. Paul Christiano—one of the originators of RLHF—has written extensively about the gap between training-time safety and deployment-time behavior. [5] The core problem: you can't train for every deployment context. The space of possible users, tasks, and situations is too large. Training-time alignment narrows the distribution of bad behaviors, but it can't specify per-user values because those values are particular to individuals.
Jan Leike, who led safety research at OpenAI before joining Anthropic, made a similar point in his InstructGPT work: "Making language models bigger does not inherently make them better at following a user's intent." [6] Alignment with a population is not alignment with a person.
The Application Layer Problem
Software architects have a term for this: the distinction between platform and application.
A platform provides general capabilities: security primitives, networking, compute. Applications run on the platform and implement specific business logic. The platform doesn't know what the application is trying to do. The application can't reconfigure the platform's fundamental properties.
Constitutional AI is platform-level. It gives you a model that won't help you build weapons or psychologically manipulate users. That's a platform guarantee.
Runtime values infrastructure is application-level. It tells the AI what this user needs, what this deployment is trying to accomplish, and how to navigate the specific tradeoffs this person would choose.
Both are necessary. Neither replaces the other.
The error the industry is making is treating Constitutional AI as the complete solution. It's not. It's the foundation. What's built on top of it—persistent, per-user, auditable values context at runtime—is the thing that makes AI systems trustworthy for individuals, not just acceptable to populations.
That layer doesn't exist yet as infrastructure. It needs to be built.
That layer is TruContext. Persistent, per-user, auditable values context at runtime. Model-agnostic. Two commands: npm install -g trucontext-openclaw — first 1,000 keys, 1M Ops free.
References
- Bai, Yuntao, et al. "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073 [cs.CL], 2022. https://arxiv.org/abs/2212.08073
- Bai, Yuntao, et al. "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback." arXiv:2204.05862 [cs.CL], 2022. https://arxiv.org/abs/2204.05862
- Perez, Ethan, et al. "Discovering Language Model Behaviors with Model-Written Evaluations." arXiv:2212.09251 [cs.CL], 2022. https://arxiv.org/abs/2212.09251
- Wang, Boxin, et al. "DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models." arXiv:2306.11698 [cs.CL], 2023 (NeurIPS 2023 Outstanding Paper). https://arxiv.org/abs/2306.11698
- Christiano, Paul. "What failure looks like." AI Alignment Forum, 2019. https://www.alignmentforum.org/posts/HBxe6wdjxK239zajf/what-failure-looks-like
- Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv:2203.02155 [cs.CL], 2022. https://arxiv.org/abs/2203.02155