A new study from Apple researchers has raised serious questions about how well advanced AI models actually reason.
The paper, titled The Illusion of Thinking, shows that even the best language models, known as Large Reasoning Models (LRMs), don’t truly “think.” Instead, they often rely on memorizing patterns.
The research tested models like Claude, DeepSeek-R1, and o3-mini on custom puzzle tasks.
These tasks allowed researchers to adjust complexity while keeping the logic structure consistent. The goal was to see not just whether models got the right answer, but how they got there.
Here’s what Apple found:
- Low-complexity tasks: Standard language models outperformed the so-called reasoning models.
- Medium complexity: Reasoning models did better by using extra steps to “think.”
- High complexity: Both types of models collapsed. Even with more time or resources, performance dropped sharply.
The paper also found that reasoning models fail to use exact algorithms and often give inconsistent answers.
Their performance increases with task complexity only to a certain point, then declines — even if given enough computation time.
Apple’s findings suggest that these models are far from achieving true reasoning or AGI (Artificial General Intelligence). The models are good at mimicking reasoning patterns, but they don’t actually understand the tasks in a human-like way.
This research challenges recent hype around reasoning models and calls for deeper testing of AI capabilities.