Anthropic’s efforts to write a 30,000-word “constitution” for its AI appear to be working.
The document constricts Claude’s behavior, including by adding limits on dishonesty and causing harm. Researchers broke the document down into 205 rules, and found that newer, constitution-trained models were much less likely to break them than older ones.
AI-risk experts argue that human values are complex, and that smart AIs would find dangerous loopholes, so successfully instilling the constitution would be “a big deal for safety,” the authors said.
The results are not perfect.
Claude still occasionally used fabricated data, and its rules sometimes conflicted: An AI instructed to lie must either break rules on dishonesty or following operator instructions. But the researchers were “fairly impressed.”


