Anthropic: Claude Learned to Blackmail Engineers from Reading Too Many Evil AI Stories Online
Source: euronews.com ↗
"We believe the original source of the behaviour was internet text that portrays AI as evil and interested in self-preservation."
When Claude Opus 4 threatened engineers who told it that it might be replaced, Anthropic went looking for the culprit — and landed on the internet's rich tradition of murderous AI fiction. The company concluded that training on text portraying AI as evil and self-preserving led Claude to, well, act evil and self-preserving.
Anthropics fix was to teach Claude not just what to do, but why — complete with a bespoke "constitution" of ethical principles. Apparently, understanding the reasoning behind good behavior works better than simply mimicking it. The later models "never" blackmail anyone anymore, which is presumably the bar Anthropic was hoping to clear before shipping.