Apple researchers tested the limits of advanced AI reasoning models called large reasoning models (LRMs). They found LRMs beat standard large language models (LLMs) on moderately complex puzzles but both collapsed on harder tasks.
Apple ran tests on puzzles like Tower of Hanoi and River Crossing, upping difficulty to see how models handled reasoning. They pitted LRMs—specifically Claude 3.7 Sonnet Thinking and DeepSeek-R1—against standard LLMs using the same compute resources.
On simple problems, standard LLMs were faster and more accurate. As complexity grew, LRMs with structured reasoning techniques like Chain-of-Thought prompts pulled ahead. But push complexity further and both failed hard: accuracy hit zero regardless of compute power.
Deeper analysis revealed strange behavior. Near failure points, LRMs cut short reasoning even when they had more compute left. They also botched step-by-step instructions, even when fed correct algorithms. Performance varied wildly based on whether puzzles were familiar or not, hinting models lean on training data familiarity rather than true reasoning skills.
Apple researchers conclude current LRMs and LLMs hit fundamental limits in “thinking” the way humans do.
Full paper here: The Illusion of Thinking (PDF)