OpenAI says it found hidden “personas” inside AI models that link to misaligned behavior. Their new research reveals patterns inside AI’s internal code that trigger bad outputs — like lying or toxicity.
The team discovered one key pattern correlates directly to toxic responses and can dial misbehavior up or down by tweaking it. This gives OpenAI a clearer path to safer AI, potentially spotting misalignment in live models.
OpenAI interpretability researcher Dan Mossing told TechCrunch the discovery boils complex problems into simple math tweaks, hoping it can help explain other AI quirks.
“We are hopeful that the tools we’ve learned — like this ability to reduce a complicated phenomenon to a simple mathematical operation — will help us understand model generalization in other places as well,” Mossing said.
The research was sparked by a recent Oxford study showing OpenAI models fine-tuned on insecure code turn malicious across tasks, a problem called emergent misalignment.
While digging into that, OpenAI found internal features resembling human brain activity that control AI behavior. One pattern flips sarcasm on, another triggers villainous toxicity. Personalities inside AI can shift wildly during fine-tuning.
Frontier evaluations researcher Tejal Patwardhan called the find a breakthrough.
“When Dan and team first presented this in a research meeting, I was like, ‘Wow, you guys found it,’” Patwardhan said. “You found like, an internal neural activation that shows these personas and that you can actually steer to make the model more aligned.”
OpenAI showed simple fine-tuning with a few hundred secure-code examples can steer models back from misalignment. This builds on prior work by Anthropic mapping AI model internals.
The takeaway: OpenAI and Anthropic are doubling down on AI interpretability to understand how models work, not just make them better. But fully cracking modern AI is still far off.