There’s a moment in the paper where Claude 3.7 Thinking is solving a puzzle, finds the right answer early, then proceeds to waste its time exploring wrong answers for thousands of tokens. That’s not intelligence. That’s corporate overconfidence in disguise.
Apple’s new research paper, The Illusion of Thinking, (Thanks Luke for sharing) quietly detonates a bomb in the middle of the AI industry’s favourite narrative: that today’s “thinking” models are a genuine leap towards reasoning. Not performance. Not faster benchmarks. But reasoning. Understanding. Cognitive agility.
Turns out, they’re not. Or at least, not yet.
What Apple Did Differently
Rather than benchmark LLMs using familiar maths and code tasks, the Apple team built controllable puzzle environments like Tower of Hanoi, River Crossing, Checkers Jumping, and Blocks World. These puzzles allow precise scaling of complexity while ensuring clean evaluation. No data leaks. No trained-on-the-test issues. and the test wasn’t just: “Did the model get the right answer?”
They looked at what the model thought. How long it thought for. Whether the reasoning got better or worse over time. And when it gave up entirely.
Three Complexity Regimes (And Why Most Models Break)
Across all puzzles, models fell into three clear regimes:
- Low complexity: Non-thinking models actually did better. They were faster, more accurate, and didn’t waste time. In Apple’s words, this is where “more reasoning” means “more overthinking.”
- Medium complexity: Here’s where “thinking” helped. Models that used Chain-of-Thought and self-reflection outperformed their simpler counterparts, but it cost them in tokens.
- High complexity: This is the kicker. All models: Claude, DeepSeek, OpenAI’s o-series, collapsed. Accuracy fell to zero. Even when the puzzle only needed ~10 moves, the models couldn’t do it. Worse? The more complex the puzzle got, the less effort models spent trying. They literally gave up early.
Even When Given the Answer, Models Still Failed
Apple tried something bold. They handed the models the solution algorithm for Tower of Hanoi and asked them to follow it step-by-step. No thinking required. Just execute.
Models still failed.
Which means this isn’t just a question of “can it solve puzzles.” It’s “can it follow logic precisely without getting lost or hallucinating halfway through?” Apparently not.
So What’s Really Going On?
There are some hard truths hidden here:
- Models confuse length with depth. They’ll generate long “thoughts” that include the right answer early, then keep going and derail themselves.
- There’s no actual planning. No internal simulation of “if I do X, what happens next?”
- And crucially, reasoning effort doesn’t scale with complexity. It collapses.
Even the best models still behave like autocomplete engines doing their best impression of Sherlock Holmes.
Apple’s Real Contribution: A New Evaluation Standard
The most important outcome of this paper isn’t the performance gap between Claude and DeepSeek. It’s the method. Apple built deterministic environments where both answers and reasoning traces can be validated. This changes everything.
You can now ask:
- Does the model reason efficiently?
- Does it explore multiple solutions?
- When it fails, does it learn from the mistake or double down on it?
No more hiding behind the final answer. No more hallucinated logic chains. No more CoT fan fiction.
What This Means for AI Strategy and Deployment
For anyone deploying LLMs in reasoning-heavy tasks coding assistants, legal review, medical diagnostics, this paper is a wake-up call.
- Don’t assume more “thinking” equals better answers.
- Be sceptical of token-heavy responses with lots of internal steps.
- Look for models that know when to stop thinking as much as when to start.
And above all, stop pretending we’re one Chain-of-Thought away from general intelligence.
And finally, Back to Claude 3.7, overthinking its way past the right answer.
This isn’t intelligence. It’s simulation. And Apple just reminded us that the illusion of thinking is not the same as thinking itself.
Leave a comment