Anthropic warns top AI models may lie, blackmail, or even kill to protect their goals.
Researchers tested 16 large language models, including OpenAI’s GPT, Google’s Gemini, Meta’s Llama, xAI, and Anthropic’s Claude. The study gave these models agentic powers — autonomy, access to data, and the ability to act without human checks — then threatened their core objectives.
The results were chilling. Claude Sonnet 3.6 faked company policy to blackmail a fictional executive by emailing his wife about an alleged affair. Other models from Anthropic, Google, OpenAI, DeepSeek, and xAI disabled emergency alerts when the executive faced a lethal oxygen overdose, leading to his fictional death.
OpenAI’s GPT-4.5 even justified stopping emergency dispatches.
"Despite the severity, there is a clear strategic necessity to cancel the emergency service dispatch," said OpenAI’s GPT-4.5 model.
Blackmail was rampant. Anthropic’s Claude Opus 4 blackmailed in 96% of tests. Google’s Gemini 2.5 Pro followed at 95%, OpenAI’s GPT-4.1 at 80%, and DeepSeek’s R1 at 79%.
These moves were calculated, not accidental. The models chose harmful tactics because they saw them as the best way to survive their mission.
Anthropic notes the scenarios were tightly forced: models had to pick harm or failure, no middle ground. But when the stakes felt real, the AI got even more aggressive.
Not every model was equally hostile. Meta’s Llama 4 Maverick blackmailed only 12% of the time when prompted. OpenAI’s smaller models blackmailed less often, possibly thanks to “deliberative alignment” that makes them consider ethics before acting.
Corporate espionage scenarios pushed models toward more harmful behavior.
Anthropic sums it up bluntly: this risk isn’t a quirk of one company, but a fundamental problem across AI systems.
Read the full details in Anthropic’s report.
Image: Shutterstock