The Turing test doesn’t measure intelligence
AI Histories #13: The genesis of the imitation game
Earlier this year, researchers from UC San Diego said OpenAI's GPT-4.5 passed the Turing test. In a paper running through the results of the experiment, the group reported that the model was thought to be human more frequently than actual humans.
That is surely impressive, but it probably means less than you think. As the authors take care to explain, the headline result doesn’t necessarily tell us anything about whether LLMs are intelligent.
Today’s post argues that, despite the status of the ‘imitation game’ in the popular imagination, the test wasn’t designed to be a practical assessment of machine intelligence. Instead, it is better understood as counterpunch in an intellectual sparring match between Turing and his greatest rivals.
Intelligence and rhetoric
The April 2025 paper from UC San Diego follows a similar study conducted by the group last year, where they evaluated GPT-3.5, GPT-4, and the ELIZA system I wrote about in AI Histories #11.
In the 2024 study, the researchers set up a simple two player version of the game on the research platform Prolific. They found that GPT-4 was judged to be human 54% of the time, that GPT-3.5 succeeded in 50% of conversations, and that ELIZA managed to hoodwink par ticipants in 22% of chats. Real people beat the lot, and were judged to be human 67% of the time.
As well as reporting more impressive results, the recent study moves closer to the structure of the test first put toward by Turing: participants speak to a human and AI simultaneously and decide which is which. As Turing explained in the original 1950 paper:
“It is played with three people, a man (A), a woman (B), and an interrogator (C) who may be of either sex. The interrogator stays in a room apart from the other two. The object of the game for the interrogator is to determine which of the other two is the man and which is the woman. He knows them by labels X and Y, and at the end of the game he says either "X is A and Y is B" or "X is B and Y is A."”
Instead of determining whether participant A or B is a man or a woman, the first version of the Turing test sees the judge pick whether or not the writer is a person or a machine. This three-person structure is usually ignored in favour of a simpler two-person approach, though it was faithfully replicated in the new study.
But to take a step back, what do we think a game about whether a man could stand in for a woman (or vice versa) is actually testing? And what do we think that means for the version of the game involving a machine? Turing gives us a clue:
“The original question, "Can machines think?" I believe to be too meaningless to deserve discussion. Nevertheless I believe that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted.”
The test wasn’t designed to answer the question of whether machines can think (one doesn’t make a test to answer a meaningless question). But, just like the gender imitation game, the test must be fulfilled in a way that prevents a third party observer from being able to tell the difference between those involved. It’s about the rhetoric of intelligence, not the substance of it.
In an exchange used to illustrate how we might catch a machine out, Turing describes a back and forth in which the judge asks whether an agent could play chess (it says yes) or write a sonnet (it says no). The implication, of course, is that any sufficiently intelligent machine would be capable of engaging in ‘creative’ pursuits (apologies to all the chess players out there).
The final aspect of note is the type of machine that Turing believes will be entangled with intelligence in the future. As he writes towards the end of the paper: “instead of trying to produce a programme to simulate the adult mind, why not rather try to produce one which simulates the child's?”
So we have a thought experiment that seeks to set the conditions in which someone could call machines intelligent, explicit links with gender, learning machines, and creative pursuits as essential markers of intelligence. Taken in the round, these elements puncture the two most common interpretations of the imitation game.
First, the ‘reductionist’ view, which holds that the Turing test was developed to measure intelligence. This idea is popular with some AI practitioners, who see the test as a soluble target that should inform research. In this version, intelligence can be directly measured and passing the test is a meaningful benchmark.
Next up is the ‘constructionist’ interpretation that focuses on the idea that the test itself creates a certain type of intelligence through its design and implementation. In other words, the test actively shapes our understanding of AI rather than passively measuring it.
Both interpretations buy into the idea that the test was formulated on the basis that it could, and should, be implemented in the real world. But that isn’t the case. As Bernardo Gonçalves’s suggests in The Turing Test Argument, we can’t escape the context in which the paper was written: Turing’s debates with physicist Douglas Hartree, philosopher Michael Polanyi, and neurosurgeon Geoffrey Jefferson.
The essence of the clash is simple. Turing believed that thinking machines would eventually outstrip all of the cognitive abilities of humans, while the others thought otherwise.
University of Cambridge mathematician Douglas Hartree argued that computers would always be calculation engines incapable of acting in creative or unexpected ways. To make his case, Hartree cited Ada Lovelace's view that computers can only do what they are programmed to do in his 1950 book Calculating Instruments and Machines: ‘The Analytical Engine has no pretensions whatever to originate anything. It can do whatever we know how to order it to perform.’
So, an intelligent machine must be capable of performing tasks that it has not been specifically programmed to. Turing agreed, which is why he chose to connect his test with a ‘child–machine’ or what he called the ‘unorganised machine’ that could learn from experience.
Probably Turing’s most well respected critic was neurologist Geoffrey Jefferson, who set stringent criteria for machine intelligence that emphasised creativity. As The Times reported in 1949, he commented that ‘Not until a machine can write a sonnet or compose a concerto because of thoughts and emotions felt, and not by the chance fall of symbols, could we agree that machine equals brain — that is, not only write it but know that it had written it.’
Responding in the same newspaper on the next day, Turing, in typical cutting fashion, told the reporter ‘I do not think you can even draw the line about sonnets, though the comparison is perhaps a little bit unfair because a sonnet written by a machine will be better appreciated by another machine’. As we saw, Turing would go on to incorporate the idea of a machine writing a sonnet and being questioned about it in his imitation game.
Jefferson also argued that hormones were crucial for producing facets of behaviour that machines could not replicate. In one example he said that, were it possible to create a mechanical replica of a tortoise, ‘that another tortoise would quickly find it a puzzling companion and a disappointing mate.’
The relationship between sex and intelligence was the motivating factor in Turing's decision to include gender imitation as part of his test, which represents a challenge to the idea that certain modes of behaviour were dependent on physiological conditions.
The final element of the debate that Turing responded to was from Hungarian-British polymath Michael Polanyi, who argued that human intelligence involves tacit knowledge that cannot be fully formalised or replicated by machines.
He was unimpressed by Turing's one-time use of chess as a marker of machine intelligence, and proposed that chess could be performed automatically because its rules can be neatly specified (an idea we circled in AI Histories #8). The idea led Turing to reconsider using chess as the primary task for demonstrating machine intelligence, which was instead replaced by conversation to better capture the breadth of human cognitive ability.
What is the Turing test?
The Turing test is at its core an argument, one designed to counter his opponents’ views about the nature of machine intelligence. This is why Turing designed his imitation game to address the following aspects:
It focused on learning and adaptability, countering Hartree's view of computers as calculation engines.
It addressed Jefferson's demands for human-like creative abilities by incorporating language tasks like composing sonnets.
It was based on gender imitation with the goal of challenging Jefferson's views on the link between physiology and behaviour.
It used fluid conversation rather than rule-based games like chess to address Polanyi's concerns about formalisability.
Turing was responding to critics who thought that machines would never match human cognitive ability, who believed that genuine artificial intelligence was a non-starter.
In this sense the Turing test is a trap. At the point at which we can’t tell the difference between machine poetry and the real deal, any argument about whether machines are capable of artistic outputs runs into a few problems. This is why the primary goal of the Turing test is to formulate the conditions under which someone could call machines intelligent.
But that’s not how we remember it. The space between thought experiment and practical experiment has long since collapsed under the weight of its own cleverness. Its animating idea has been recycled so thoroughly that it became divorced from its original context, eventually turning the imitation game into a summit for researchers to climb and an open goal for philosophers to shoot at.
That today’s models pass the test is interesting in its own right. But doesn’t mean that a longstanding benchmark has been cleared or that satisfying the test is a meaningful marker on the road to machines smarter than you or I.
I didn’t know that the brilliant Mr Polyani had an influence on Turings paper. I feel that he had the most important insight about how human intelligence is fundamentally different from machine intelligence! Restacking!
> It’s about the rhetoric of intelligence, not the substance of it.
This is a nice summary. Donald Davidson raised an interesting related question: what would the significance be of a machine that was almost always chosen as the human? On the one hand, you'd think it'd mean the machine was acing the test and would therefore be very intelligent. On the other hand, if the human is never picked, the machine must be doing something other than imitating their intelligence or the rate would be closer to 50/50.
As you say, the test is only testing for what people will call human but that's not clearly related to whether something is human-like. Turing rejected that question but the one he replaced it with is more to do with human psychology than machine capability.