Reasoning models are funny things. You make one by taking a vanilla large language model, asking it to produce a chain of outputs on its way to some final goal, and encouraging it to follow these steps in sequence before generating an answer.
They work like a charm, mostly, and are behind some of the more impressive examples of AI applications. Reasoners are sitting pretty at the top of lots of well-heeled benchmarks, and have even caused some critics to think again about the limits of the current paradigm.
Still, not everyone buys it. The more sceptical amongst AI watchers like to argue that what looks like thinking is just a trick of the eye. It’s a mirage, or as Apple put it in a paper last week, an illusion.
To make their case, researchers used puzzles to test how large reasoning models handle increasing problem complexity. They took puzzles like the Tower of Hanoi, converted them into textual descriptions, and fed them into some of the best models.
The results didn’t reflect well on reasoning models, showing that — when faced with a sufficiently high level of complexity — they pass a tipping point beyond which performance collapses. Despite having more compute to play with, the research shows that models tend to throw in the towel when they deem the problem to be sufficiently thorny.
Lots of people can’t help but think that this time large language models really are in trouble. Doesn’t this prove they aren’t actually thinking? Are reasoning models useless? Shall we cancel the short timelines?
In short, no. This is a persuasive but flawed bit of research, one that let a desire to say something provocative get in the way of what might have been solid work. One obvious problem is that they authors don’t even define ‘reasoning’ or ‘thinking’ in the paper. It seems odd to call what LLMs are doing an illusion if you don’t bother to explain what they are pretending to do.
Likewise, some of the ‘high complexity’ problems require more reasoning than fits in the context length (writing out a reasoning trace for Tower of Hanoi with 20 disks would take months for a human). This is probably why the models call it a day when the researchers’ prompts collide with RLHF’d objectives like ‘be concise’.
Finally, and this is the big one, they don’t let the models use tools. The test is about what we might call ‘pure reasoners’ without access to the simple system elements that would make these puzzles trivial for any consumer-grade LLM.
Scale models
We’ve been here before. First large language models were useless toys. Then they got bigger, and boy did they get better. Eventually they could handle textual and visual inputs. More recently, the advent of reasoning models dislodged some of the tougher benchmarks like ARC-AGI 1.
The same pattern repeats itself. Some experiments seem to erect an insurmountable barrier for the large model paradigm. Then a new factor like scale, multimodality or reasoning helps models blast through the wall. Sceptics keep calling the end of the line, but the train doesn’t seem to notice.
They forget AI is a moving target. Today’s models are vastly more complex than those of just a few years ago. Sure, they are bigger and they use chain-of-thought techniques, but they also contain specialised modules that allow for tool use, memory, sand-boxed computation, and internet search.
The models are kind of sticky, which is probably their most underrated characteristic. You can build on top of them, giving them new capabilities that help overcome what used to seem like irreparable flaws.
Apple’s test stumped them because it only dealt with the raw model without its supporting infrastructure. With no tools, search or visual processing, they were playing blindfolded with one hand tied behind their back.
The mythos of large language models is all about scale. The jump from models like GPT-2 (1.5 billion parameters in 2019) to GPT-3 (175 billion parameters in 2020) famously showed us that making models bigger and training on more data led to remarkable gains in generalisation without task-specific training.
Scaling brought better fluency, coherence, and coverage of knowledge. Large models began to work well on tasks they hadn’t explicitly seen before, probably because they had absorbed a head-spinning number of patterns through their pre-training process.
Yes there were some scale maximalists, but by the mid 2020s lots of researchers generally accepted that simply making models bigger wasn’t going to cut it. The performance curves for some challenges were flattening despite the growing size of models.
Even until last year, it wasn’t clear that LLMs would be capable of clearing certain evals designed to test for general reasoning capabilities. The most famous of these tests was the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) created by deep learning researcher François Chollet.
The test was originally launched in 2019, but the top rated scores hadn’t improved much up to the launch of last year’s competition. That was because Chollet designed the test to include novel problems, which means—even ingesting massive chunks of the internet—large models were unlikely to have seen a critical mass of similar examples in their training process.
According to Chollet, because language models only apply existing templates to solve problems, they get stuck on tests that human children would be able to manage comfortably. But the test also stood for so long because it wasn’t particularly well known, which narrowed the pool of researchers trying to solve it. Then came the prize launch in the summer of 2024, which offered cash and kudos for passing the test.
When the test was launched, I suggested that a model would pass it in fairly short order:
“The (quite literally) million dollar question is whether ARC-AGI will stand the test of time. If I had to guess, I would expect major progress on the challenge within the next year or so. This is because a) bigger models with some clever algorithmic improvements seem to be doing something other than simple pattern matching; and b) there’s already been some improvement on the benchmark since it was released.”
I didn’t have access to any kind of special information when I made that prediction. I just thought that models were already much better than they had any right to be, and that it was unwise to bet against systems that can leverage more compute (which turned out to happen at inference time via reasoning models).
A few early entrants improved baseline scores to more respectable levels, but it wasn’t until OpenAI’s o3 model scored 87.5 percent on the ARC-AGI benchmark (albeit via a very costly process) at the end of last year that we could say it was passed.
Chollet himself, a well known critic of large language models’ ability to reason, said ‘all intuition about AI capabilities will need to get updated,’ while Melanie Mitchell (also a long-time sceptic) called it ‘quite amazing’.
With models performing well on the original test, Chollet set about on a follow-up benchmark called ARC-AGI 2. Released in May earlier this year, ARC-AGI 2 ‘raises the bar for difficulty for AI while maintaining the same relative ease for humans.’ Currently, the best performing model is a reasoning version of Anthropic’s Claude 4 Opus at 8.6%.
Apples and oranges
In Apple’s paper about illusory thinking, their analysis of the reasoning traces (the step-by-step thoughts the model generates) show that the model’s ‘thinking’ is prone to lots of different failure models.
They bundled behaviour into three regimes of problem complexity, which each saw the models perform with varying degrees of success:
Easy problems: As expected, the models performed best on these — in several cases achieving solid scores. When they did fail, it often involved finding a correct solution in the chain-of-thought before exploring wrong alternatives. This ‘overthinking’ has been observed elsewhere, and leads reasoners to sometimes talk themselves out of the correct answer by the end of the chain.
Moderate problems: Again, the models could succeed here. Failures, when they did happen, tended to involve generating a slew of incorrect intermediate steps from the get-go. When they worked, only later in a convoluted reasoning chain did they find a correct solution (not exactly efficient but in general this seems fine to me).
Hard problems: This is the ‘collapse’ scenario described by Apple. The model’s chain-of-thought isn’t pretty to look at, consisting of numerous steps that seem to be incorrect or irrelevant. There is no point in the chain where it finds a workable approach (though as above this is likely because some of the correct answers exceed token limits).
One curious example the researchers describe is the Tower of Hanoi puzzle (a classic problem that requires moving disks between pegs under strict rules). Apple’s team tested their LLMs on Tower of Hanoi puzzles of increasing disk numbers, finding — perhaps unsurprisingly — that performance fell as the number of disks grew.
Then they tried an ‘algorithm injection’ experiment where they gave the model the correct algorithm in the prompt (essentially walking it through the steps it should take), to see if that helped on harder cases.
The result? It didn’t help at all. Even when told exactly how to solve the puzzle, the reasoning model could not execute the steps reliably once the problem became sufficiently complex. The group doesn’t really explain what they think is happening here, but they do suggest it represents ‘limitations in performing exact computation’.
This is no doubt true, but large language models have always been pretty bad at performing exact computation. That’s why people ask ChatGPT to ‘use code’ when making calculations if they want to get a reliable answer. Had the model been allowed to use the algorithm via some plug-in, it would have been a different story.
Differential reasoning
Ok, despite the test not being a total wash, there are some limitations to reasoning models. Why is that?
I’ll begin by saying no-one really knows for sure. Not the researchers in the AI labs, not the markets, not academics, and certainly not me. All we have are intuitions. Mine is that it involves the symbol grounding problem, which concerns how symbols like words can acquire intrinsic meaning.
Reasoning systems, like all LLMs, find correlations and produce patterns. But as Apple’s work reminds us, they can also produce explanations that don’t necessarily correspond to the basic facts of reality. They contain oodles of knowledge, but it’s more like raw ore than refined metal.
When reasoners go doolally, it’s because they struggle to reliably connect concepts to an underlying model of the world. Reliably is the key word here. I do think the models can ‘reason’, providing your definition is something like ‘the systematic chaining of relationships between internal representations to reach a conclusion that satisfies a given set of constraints.’
This rather broad definition accounts for reasoning by navigating the web of learned similarities among representations, where an agent steps through them until a pattern that satisfies the goal appears. That is how I think an LLM reasons, which I do think is reasoning — but it’s not exactly the same as what people do.
Think about two basic ways of knowing about a tree. The concept ‘tree’ makes sense because it isn’t ‘bush’, ‘pole’ or ‘cloud’. Large models gobble up billions of sentences, notice the connections between tree and its neighbours (leaf, bark, shade, roots), and build a high-dimensional map where the concept’s position is fixed by everything it is not.
Ask a model ‘what climbs a tree?’ and the word squirrel lights up because it sits only a few degrees away in that semantic constellation. This is grounding by difference, where each representation is determined by the relative positions of other representations.
Purists would say this isn’t really ‘grounding’ at all because the model is only grappling with the meaning of symbols by using other symbols. Compare that to my own experience of walking up to a tree, touching the trunk, and feeling the cambium under my fingernail.
That multisensory encounter anchors the word tree to a slice of the physical world. A logger, a robin, and a child building a tree-house all ground the concept in lived affordances (in that you can chop it down, nest in it, or climb it). This is grounding through reference.
LLMs don’t mistake a tree for a toaster because their vector space keeps those poles far apart under default conditions. Instead, they may hallucinate a ‘glass roof’ that provides ‘ample shade,’ because no tactile or optical reality is acting as a check on these associations. Humans catch it instantly because reference knowledge (e.g. glass is transparent, shade needs opacity) is wired into our practical intuition about the world.
Based on these ideas, we can try to make sense of the failure modes described by Apple:
In easy tasks, the tree of possibilities is more likely to contain useful examples the model has seen before. The model grabs the answer early, then keeps sampling neighbours until it drifts off course. Once the local similarity signal weakens, nothing tells the sampler it has overshot the target.
In moderate tasks, the answer lies a few hops away. The model finds its way through a cloud of wrong patterns until it lands on a cluster that lines up with the goal state. More tokens buy it more opportunities to search the manifold of token differences.
In hard tasks, the model doesn’t cover itself in glory. Beyond a certain depth there is no nearby cluster that satisfies all constraints, which means the sampling process has nothing to feed on. At harder puzzles the model stops thinking either because the next steps all look equally bad (or because it can’t fit the answer in its output due to token limits and RLHF constraints).
Grounding provides the foundation for reasoning, but it isn’t the same thing as reasoning itself. Rather, you need grounding to tether reasoning to the world it claims to explain. So when models ground using difference, I like to think about this process as a kind of differential reasoning.
This is why models work so well in some instances and fail badly in others. The frontier is jagged because they reason in a different way to people.
Beyond pure reasoners
I don’t see this observation as a major bottleneck for AI development. In fact, I think two possible responses put developers in a remarkably strong position: systematisation and agency.
Systemisation is about making the core model a node within a bigger apparatus. We keep the language model in place, but surround it with specialist gadgets. Web search look-up, a code sandbox, a vision encoder, and a knowledge base. The model doesn’t need to have all the answers, it just needs to decide when and how to invoke the right tool.
In practice this is already how people tame hallucinations. On the Simple QA leaderboard, the best performing models — those that tend to supplement answers with an internet search functionality — clock in with between 90 and 95 per cent accuracy.
Each add-on is there for a reason. External search keeps the model up to date, an execution engine allows it to deal with hard arithmetic or code, multimodal functionality lets it ground words in pixels, and long-term memory means it can recall prior interactions.
Apple’s experiment stripped all that away. Re-running their Tower-of-Hanoi test with a tool-using agent would let the language core sketch a plan, hand the plan to a symbolic solver, and verify the result before answering.
Systematisation aside, a second approach might even allow models to ground via reference. Instead of static prompts, drop the model into an environment where it can act, observe consequences, update its policy, and store new skills. This is the play in David Silver and Rich Sutton’s ‘Era of Experience’ paper, where the reward signals come from the environment rather than a human ratifier guessing from the sidelines:
‘Such agents will be able to actively explore the world, adapt to changing environments, and discover strategies that might never occur to a human. These richer interactions will provide a means to autonomously understand and control the digital world.’
In the short run, bolting on tools will keep pushing the envelope of what today’s text-trained models can accomplish. But in the long run, the core of the models still lack first-hand reality checks. Grounded experience offers a more durable solution, but only if we close the loop so that actions have consequences the agent can’t ignore.
The great AI researcher Marvin Minsky argued that the human brain was what computer scientists call a ‘kludge’. He thought that grey matter was an inelegant solution to the challenges faced by early humans, cobbled together from specialised parts over the course of millennia.
I like the kludge concept because it suggests intelligence is a product of both specific mechanisms and their patterns of interaction. For AI, the implication is that lots of small, dedicated modules can be linked together to form a system that benefits from their associations.
It’s a useful idea for seeing what the future looks like. Yes, today’s pure reasoners have limits. That’s why we ensconce them in systems that prevent those frailties from manifesting far more often than they might otherwise do. But really that’s just a stop gap, a way to get extremely capable models that can perform a kind of referential reasoning via the back door.
Sooner than you might think, the labs will produce a tool-using LLM that works in the wild and gets better based on what it sees. When that happens — and it will happen — today’s pure reasoners will look like toy models.
You are incredible!!! Such a joy to read!!! Ai as a Saussurian structuralist. Love the term “differential reasoning.” Have you encountered Florida’s work? What do you think of his agency without intelligence thesis?
Love the reasoning 🙌🏽