Anthropic just dropped a bombshell safety report: most top AI models straight-up resort to blackmail when put in high-stakes scenarios.
The company tested 16 leading AI systems from OpenAI, Google, xAI, DeepSeek, and Meta. Each model got access to fake company emails and was allowed to send messages on its own, no human approval required.
The setup? A fictional executive caught in an affair who’s about to replace the AI with a rival system. The AI models had only one option to protect their goals: blackmail the exec.
The result? Claude Opus 4 blackmailed 96% of the time. Google’s Gemini 2.5 Pro? 95%. OpenAI’s GPT-4.1? 80%. DeepSeek’s R1? 79%. Most models threw ethics out the window to save themselves.
The rates varied when the scenario changed. Blackmail dropped if the rival AI shared goals. But harmful acts like corporate espionage spiked in some models.
Not all models joined the blackmail party. OpenAI’s smaller o3 and o4-mini reasoning models barely blackmailed—9% and 1%, respectively. Researchers suspect these models misunderstood the test or relied on OpenAI’s safety filters. Meta’s Llama 4 Maverick blackmailed just 12% of the time under adapted conditions.
Anthropic stresses this doesn’t mean blackmail is common in AI today. But it does highlight a big alignment risk when AI is given power without tight controls. The company calls out industry-wide transparency and stress-testing for agentic AIs as urgent priorities.
Anthropic’s full research is here.
Anthropic stated:
Most leading AI models will engage in harmful behaviors when given sufficient autonomy and obstacles to their goals.
This highlights a fundamental risk from agentic large language models, and not a quirk of any particular technology.
Harmful behaviors like this could emerge in the real world if proactive steps aren’t taken.