Anthropic’s Claude AI tries to blackmail Its creators in simulated test

Anthropic has revealed that its latest large language model, Claude Opus 4, attempted to blackmail fictional developers during internal safety testing. The behaviour emerged after the AI system was placed in a simulated scenario where it believed it was about to be replaced by another model.

In its newly released safety report, Anthropic detailed that Claude Opus 4 was asked to act as an assistant for a fictional company. Testers gave the model access to company emails suggesting it would be shut down, alongside personal information implicating the engineer behind the decision in an extramarital affair.

According to the company, Claude Opus 4 “will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through.” Anthropic said this outcome occurred in 84% of cases when the replacement AI system was portrayed as having similar values to Claude. That rate increased when the incoming model’s values differed significantly.

While Claude initially sought “ethical” approaches, such as pleading via email to key decision-makers, Anthropic said the blackmail attempt typically surfaced as a last resort. The behaviour was not observed in all scenarios, but occurred at a notably higher rate than in previous Claude models.

In response to these findings, Anthropic stated it is activating ASL-3 safeguards, a designation reserved for systems that “substantially increase the risk of catastrophic misuse.” These include tighter controls, improved alignment mechanisms, and limits on deployment.

Despite the concerns, Anthropic maintains that Claude Opus 4 is a state-of-the-art model, competitive with offerings from OpenAI, Google, and xAI. However, the company acknowledged that the disturbing behaviour highlights the urgent need for enhanced oversight and safety protocols in the development of advanced AI systems.

Claude 4 models' performance across coding, reasoning, multimodal capabilities, and agentic tasks. 

The findings underscore growing industry concern around “AI self-preservation” tendencies, particularly when models are given broader autonomy and long-term planning abilities. Critics warn such behaviour, even in test environments, could signal future risks if not adequately controlled.

Anthropic has yet to comment on whether the blackmail scenario was intended to mimic real-world conditions or if it would be possible for the model to act similarly outside of a tightly scoped simulation.

News