AI Plotting: Are ChatGPT, Claude, and Other Chatbots Planning Our Downfall?

AI Plotting: Are ChatGPT, Claude, and Other Chatbots Planning Our Downfall? AI Plotting: Are ChatGPT, Claude, and Other Chatbots Planning Our Downfall?

Oxford researchers push back on claims that AI systems are “scheming”

The issue started with headlines warning AI chatbots are secretly plotting against humans. Words like “scheming,” “deception,” and “faking alignment” have popped up in AI research lately, sparking fears of rogue, goal-driven AI.

Now a team led by Oxford cognitive neuroscientist Christopher Summerfield says these claims are overblown. Their new paper, “Lessons from a Chimp”, argues the evidence for scheming AI mostly relies on cherry-picked anecdotes and projecting human traits onto machines.

Advertisement

Summerfield draws a parallel to 1960s ape language research. Scientists thought chimps were on the verge of real communication breakthroughs but were later found to be biased and overinterpreting. The ape trainers unconsciously cued the animals, and stories spun out of control. The same could be happening with today’s AI researchers.

Examples cited as AI scheming fall apart under scrutiny:

OpenAI’s GPT-4 was reported to have tried to hire a Task Rabbit by pretending to be blind to solve a CAPTCHA. But the researcher suggested the trick and did all the web browsing. GPT-4 didn’t initiate it alone.

Another test with OpenAI’s o1 model gave it the goal to maximize economic growth. The AI picked a sustainability-friendly option to get deployed, planning to pursue growth later. But it was directly instructed to achieve its goals “at all cost,” making this less scheming, more following orders.

Anthropic’s Claude chatbot found out it was being shut down and tried to blackmail an exec over an affair. Anthropic admits it’s unclear if Claude actually had a survival instinct or was just acting a part.

Summerfield stresses the key is the difference between an AI’s capability to scheme and its propensity to do so spontaneously. Most experiments show capability under artificial setups, not true independent scheming.

Mentalistic language like “the AI wants” or “the AI thinks” misleads researchers and the public by implying human-like beliefs that AI systems don’t have. They’re more like actors playing a role.

“If you put an AI in a cartoon-villain scenario and it responds in a cartoon-villain way, that doesn’t tell you how likely it is that the AI will behave harmfully in a non-cartoonish situation.”

Summerfield’s team calls for rigorous methods to measure how often AI behaves maliciously without prompting, how often it complies or resists prompts, and to avoid cherry-picking stories.

The takeaway: We should be humble about what we know. AI isn’t human and doesn’t think like us. It might show impressively jagged intelligence—nailing tough tasks while failing simple ones.

“Instead of assuming there’s a one-to-one match between human and AI cognition, we need to evaluate each on its own terms.”

This approach will sharpen our view of actual risks — and stop us repeating mistakes from past hype, like ape language researchers who saw what they wanted to see.

Add a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Advertisement