Learning From Examples: AI Histories

Bell Labs' last trick

Harry Law — Thu, 09 Oct 2025 10:26:22 GMT

The Juggler by Remedios Varo (1956)

In March 1995 in Holmdel, New Jersey, three men put pen to paper on a wager. Larry Jackel speculated that the inner workings of neural networks would be revealed by the year 2000. Uzbek mathematical theorist Vladimir Vapnik forecast that ‘no one in his right mind will use neural nets that are essentially like those used in 1995.’ Yann LeCun provided the signature on both bets as an ‘official’ observer.

All three worked at the Adaptive Systems Research Department, an influential machine learning outfit at Bell Labs. As for the bets, received wisdom holds that Jackel and Vapnik were both wrong. The millennium came and went, while a comprehensive understanding of the internal processes of artificial neural networks continued to elude researchers.

Today, we know that it didn’t matter. Neural networks are king, and we still don’t really understand everything about how they work. The systems had made huge strides since the days of the single layer perceptron (AI Histories #7), the emergence of the Hopfield network (AI Histories #3), and the popularisation of backpropagation in the 1980s (AI Histories #6), but their stratospheric rise was no sure thing in the 1990s.

One promising alternative was the support vector machine (SVM). Developed by Isabelle Guyon, Bernard Boser, and Vladimir Vapnik in the early years of the 1990s, the system promised a way to identify a reliable boundary between categories that could generalise well to new data.

Where neural networks learn by trial and adjustment, SVMs solve for the single optimal boundary from the start. Where the former stresses flexibility and scale, the latter is focused on precision, stability, and mathematical guarantees. It was an attractive combination, one that offered a mixture of reliability and interpretability at a moment when most learning systems were highly opaque and unstable.

Subscribe now

From Russia with love

Vladimir Vapnik was a Soviet statistician whose relocation to Bell Labs in 1991 brought statistical learning theory into contact with American engineering practice. Born in 1936 in Tashkent, then part of the Uzbek Soviet Socialist Republic, Vapnik was educated at Uzbek State University before studying under the cybernetician Aleksandr Lerner at the Moscow Institute of Control Sciences.

He entered Bell Labs as a mature scholar whose intellectual formation had taken place almost entirely within the Soviet academy. At the Moscow Institute of Control Sciences, his collaborations with Alexey Chervonenkis had produced Vapnik–Chervonenkis (VC) theory, a mathematical framework for analysing the conditions under which models generalise from sample data to unseen cases.

The Moscow Institute of Control Sciences, founded in 1939 under the auspices of the Soviet Academy of Sciences, was the country’s principal centre for research in cybernetics, automation, and systems theory. By the 1960s it had become a hub for work on “automatic recognition,” the effort to design algorithms that could classify signals and images without human supervision. As Emmanuil Braverman, one of the researchers at the group put it in 1966, its staff were concerned with the “problem of teaching the machine image recognition without teacher”.

The capacity of a model refers to its ability to fit a variety of functions, with a high-capacity model capable of fitting complex patterns and a low-capacity model better suited to fitting simpler patterns. The VC dimension measures capacity by identifying the largest number of points that a model can ‘shatter’ or classify according to every possible way of labelling them. Despite the knotty mathematics involved, the central insight behind VC theory was a simple one: a model’s ability to generalise is related to its complexity, not just its performance on training data.

In this framing, capacity control pointed towards a conception of machine learning as a principled act of discovering the right predicates such as invariances, symmetries, or structural constraints. It was a philosophy that set his work apart from the empirical pragmatism of American engineering, a school of thought that emphasised the adjustment of algorithms and the accumulation of heuristics rather than a search for universal principles.

Vapnik distrusted the interpretive shortcuts so common in applied work, later likening them to Antonie van Leeuwenhoek’s descriptions of blood cells as ‘armies’ fighting under a microscope: ‘He [Leeuwenhoek] saw something. Yes. But he gave wrong interpretation.’ In Vapnik’s eyes, only mathematics could disclose reality without distortion. In this sense, the Uzbek carried into Holmdel the last great geometric faith of the Soviet control sciences. It was austere, axiomatic, and estranged from American pragmatism, but waiting to be made real.

Towards support vector machines

At its core, an SVM is a type of machine learning model that works by drawing the best possible line (or ‘hyperplane’) between different groups of data points. Support vector machines do not consist of multiple layers or nodes. And they do not rely on the process of ‘non-convex optimisation’ through which a neural network adjusts to fit the data, like finding the lowest point in a landscape with many hills and valleys. Instead, support vector machines find the boundary between groups of data in a manner analogous to finding the bottom of a single, smooth bowl.

For Vapnik, this geometric purity was the SVM’s defining virtue. Where neural networks relied on non-convex optimisation, SVMs posed a single convex problem with a unique global solution. In his later reflections, Vapnik was blunt about the contrast. With “deep learning,” he remarked, “they invent something. And then they try to prove advantage of that through interpretations, which [are] mostly wrong.”

By the early 1990s Bell Labs had its own settled approach to learning problems, one that flowed from a Western statistical tradition developed independently of the Soviet school. Where the Western tradition valued empirical pragmatism, the Soviet school valued axiomatic formalism. Where the former stressed empirical benchmarks and engineering heuristics, the latter favoured mathematically principled formalisms.

The Western tradition begins with Ronald Fisher in the 1930s, who showed how to classify data by drawing a line — a linear discriminant — that best separates groups. That gave the field its basic geometry by separating surfaces as a way to reason about data, which we discussed in AI Histories #16. After the war, Abraham Wald (the star of AI Histories #10) reframed inference as decision-making under uncertainty. His “decision theory” treated statistics as a dynamic process of minimising risk.

Through the 1960s–70s, this optimisation-driven approach was absorbed by engineers tackling pattern recognition and signal processing. Researchers like Thomas Cover and O.L. Mangasarian cast classification as a solvable optimisation problem. Duda and Hart’s famous 1973 Pattern Classification and Scene Analysis codified the Western field’s pragmatic style by laying out a toolbox of methods.

By the early 1990s, most of the technical contingencies were in place for the development of support vector machines: statistical learning theory and the VC dimension to control model capacity, optimal margin algorithms for finding decision boundaries, and methods to estimate relationships between variables based on random samples of data points.

When Vapnik arrived from Moscow in 1991, he entered a culture shaped by risk, optimisation, and empirical testing rather than axioms. It was a tradition that prized what worked over what could be proved, one that would later fuse with Soviet-style formalism through kernel methods — a class of algorithms that use kernel functions to operate in high-dimensional feature spaces without explicitly computing coordinates.

A kernel is a similarity function between two data points that satisfies certain mathematical properties, allowing it to represent inner products (functions that take two vectors and return a single number) in high dimensional space. In 1950 Polish-American mathematician Nachman Aronszajn established fundamental properties of reproducing kernels that would ultimately allow this mapping process to take place.

The kernel trick

By 1992, Vapnik was working with Isabelle Guyon to translate the abstractions of statistical learning theory into a classifier with measurable generalisation performance. The challenge was how to build a system that could learn from examples (clue klaxon) without either overfitting — memorising the training set so well that it failed on new cases — or underfitting, failing to learn enough to solve the task at all.

In essence, the problem was how to regulate the “capacity” of a model so that it captured just enough structure to generalise beyond its training examples. The kernel trick — rediscovered in Holmdel through the collaboration of Guyon, Vapnik, and Bernhard Boser — supplied the missing piece. It allowed the abstract guarantees of statistical learning theory to be embodied in a classifier that engineers could use, providing a bridge between the theoretical space of capacity control and the empirical world of pattern recognition.

A central idea behind the kernel trick is the concept of duality, which shows how different types of classifiers can be viewed as ‘dual representations’ of the same decision function. The principle means that the same classification problem can be represented in ‘primal space’ or in ‘dual space’. This idea is important insofar as it allows the algorithm to switch between primal space and dual space depending on whichever is more computationally efficient for a given problem.

The moment that support vector machines were developed looms large in ML mythology. It begins with Bernhard Boser’s decision to leave Bell Labs in 1991 for a position at UC Berkeley. Boser, a hardware designer, was unable to start a new project in the intervening months between concluding his work at Bell Labs and beginning a new position in California. Instead, he chose to implement an algorithm from Vapnik, developed in the 1960s, which sought to find the best boundary that separates different groups of data points.

Once complete, Vapnik proposed making the algorithm ‘nonlinear’ to enable the model to deal with distributed data points that cannot be separated well with a straight line. But where Vapnik advocated to solve this problem using a ‘polynomial’ approach, Guyon had a different idea. Instead of explicitly creating new polynomial features, Guyon proposed using the ‘kernel trick’ based on work by Duda and Hart (and described independently by the trio of Aizerman, Braverman, and Rozonoer in Russia).

It was this approach that led to the emergence of the support vector machine as it is commonly understood today. Guyon, Boser, and Vapnik published details of the kernelised algorithm at the Fifth Annual Workshop on Computational Learning Theory (COLT ’92). Reflecting in 2016 on the development of the support vector machine, Guyon described an initial hesitance on the part of Vapnik due to the origins of the potential function algorithm from the group at the Moscow Institute for Control Sciences:

“After some initial success of the linear algorithm, Vladimir suggested introducing products of features. I proposed to rather use the kernel trick of the ‘potential function’ algorithm. Vladimir initially resisted the idea because the inventors of the ‘potential functions’ algorithm (Aizerman, Braverman, and Rozonoer) were from a competing team of his institute back in the 1960’s in Russia! But Bernhard tried it anyways, and the SVMs were born!”.

The emergence of the support vector machine marked the culmination of decades of theoretical and practical advancements in pattern recognition, statistical learning theory, and optimisation techniques. At Bell Labs, the collaboration between Boser, Vapnik, and Guyon brought these disparate threads together. Boser’s implementation of Vapnik’s optimal margin algorithm provided a starting point, while Vapnik’s proposal to add nonlinearity sought to address the challenge of complex data distributions.

Ways of learning

What Boser, Vapnik, and Guyon achieved in Holmdel was the blending of two intellectual cultures. From the Soviet side came the abstractions of VC theory and structural risk minimisation, with their insistence on general principles and theoretical bounds. From the Western side came a tradition of pattern recognition rooted in empirical performance, approximation methods, and the willingness to bend mathematics to fit messy data.

The result was a machine that embodied the theoretical guarantees of convex optimisation and margin maximisation coexisting with the practical imperatives of implementation and performance. Within Bell Labs’ institutional culture, this interaction demonstrated that ideas forged in the high formalism of the Soviet control sciences could be translated into efficient tools for American industry.

The support vector machine represented the point at which theory and practice, abstraction and application, converged in code in the offices of New Jersey. Its development marked the closing chapter of a geometric conception of intelligence that had defined the twentieth century, one that imagined learning as the discovery of stable forms and separating surfaces in high-dimensional space.

This is why Vapnik bet Yann LeCun that artificial neural networks were a dead end. The wager, a bit of fun but entirely sincere, expressed divergent conceptions of what “learning” meant. For LeCun, intelligence was a matter of distributed adaptation: systems that adjusted their weights through experience until useful representations emerged. For Vapnik, it was an exercise in geometry and proof, which is why he explained his approach stressed finding and formalising axioms.

While deep learning eventually proved triumphant, the split made visible the moment when the field’s centre of gravity shifted from the geometric to the statistical, from global optima to local gradients, from the certainty of separability to the fluidity of representation. The bet is a hinge in the history of artificial intelligence, a moment that divided an older tradition of mathematical certainty from a new era defined by probabilistic depth and empirical abundance.

Subscribe now

The Forgotten Man

Harry Law — Thu, 25 Sep 2025 10:25:23 GMT

Nicolas Rashevsky via the University of Chicago Photographic Archive

Nicolas Rashevsky was born in Ukraine in 1899. He was the eldest son of a sugar-factory owner, and studied theoretical physics at St. Vladimir Imperial University (now Taras Shevchenko National University of Kyiv) before receiving his doctorate in 1919. By the time he finished his studies, the country he had grown up in was already gone.

Russia’s disastrous experience in the First World War had shaken the empire. The October Revolution of 1917 brought Lenin’s Bolsheviks to power, and by 1918 the country was embroiled in a brutal civil war. On one side stood the Reds, the new Soviet regime and its Red Army. Arrayed against them was a loose coalition known as the Whites made up of monarchists, liberals, Cossack hosts, and other anti-Bolshevik groups. Rashevsky served with the Whites, who were eventually defeated when their last southern stronghold in Crimea collapsed in November 1920.

Faced with few good options, the remnants of the Whites evacuated across the Black Sea to Constantinople. Within a year, Rashevsky was teaching physics and mathematics at the city’s Robert College before making the trip west to take up a position at the Russian University in Prague in 1921. After three years in Prague, he emigrated to the United States in 1924 to work at Westinghouse Research Laboratories in Pittsburgh where he published on colloidal particles and the physical chemistry of cell division.

In 1934, he was invited to take up a position in the physiology department at the University of Chicago. It was a good fit for Rashevsky, who developed the first mathematical description of nerve excitation in 1933 before formalising a mathematical model of the neuron in Mathematical Biophysics in 1938. By this time the idea that neurons were discrete cells communicating via specialised junctions was widely accepted in the world of neurophysiology. As we saw all the way back in AI Histories #1, that development was largely thanks to the work of Spanish histologist Ramon y Cajal.

While Rashevsky did not cite Cajal in his work, he did take Cajal’s findings — that individual neurons exist and interact at synapses — as the basis for representing neural circuits. Rashevsky took for granted the anatomical foundation provided by the Spaniard: that a neuron was a cell body with an axon, dendrites, and synapses transmitting impulses unidirectionally.

Mathematical biology

In 1939 Rashevsky founded the Bulletin of Mathematical Biophysics, the first international journal devoted to mathematical biology. In the early issues, many papers were written by Rashevsky himself and his close collaborators on topics ranging from neuron models to cell metabolism to population dynamics. The most famous of these was Pitts and McCulloch’s ‘A Logical Calculus of the Ideas Immanent in Nervous Activity’, an essential paper in the AI canon and the subject of AI Histories #17.

Historian Roberto Cordeschi explains the relationship between Rashevsky’s earlier work and the McCulloch-Pitts: “Rashevsky had tried, in his 1938 Mathematical Biophysics, to analyze neural phenomena mathematically. In 1943, McCulloch and Pitts introduced Boolean algebra to describe nets of formal neurons.”

Rashevsky’s neuron was written in the language of physics, through coupled differential equations for abstract ‘excitatory’ and ‘inhibitory’ variables that rose and fell over time. To know whether a model neuron would fire, you had to work through those equations step by step and track changes in continuous variables. The appeal of the McCulloch–Pitts version was its simplicity. Instead of wrestling with changing quantities, they reduced the problem to a rule: if the inputs cross a threshold, the neuron fires; if not, it stays silent.

Rashevsky’s style left him stranded between two camps. To most biologists, his equations looked too abstract and too far removed from experimental life. To most mathematicians and logicians, his differential equations — formulas tracking how quantities change step by step over time — looked too messy. His neurons lived in the analogue world of continuous change, not the logical universe of on/off switches described by the McCulloch-Pitts model.

For this reason, many of the influential later neural network formalisms can be traced more directly to McCulloch and Pitts than to Rashevsky. Frank Rosenblatt’s perceptron (AI Histories #7) was essentially a network of McCulloch-Pitts neurons with adjustable weights and a learning rule. So when Marvin Minsky and Seymour Papert put the boot into neural networks in 1969, they talked about perceptrons as binary threshold units rather than continuous, analogue models favoured by Rashevsky.

That said, today’s neural networks have crept back towards Rashevsky’s way of thinking. Instead of only treating neurons as simple on/off switches, some modern models describe how activity flows continuously over time. These developments were not directly inspired by Rashevsky — they came from control theory and physics as in AI Histories #6 — but in a way they vindicate Rashevsky’s intuition that continuous dynamics are fundamental to understanding neural computation.

Rashevsky’s work didn’t feature in the emerging computer science-oriented AI stream, but its conceptual legacy persisted. The notion of treating the brain as a network that can be quantitatively analysed is something that AI inherited from Rashevsky and the others who followed his lead.

The Russian’s career shows us that AI’s origins don’t just run through logic and computing. By writing down the first equations of neural activity, he opened the possibility of treating the brain as a system that could be formalised, analysed, and perhaps replicated. McCulloch and Pitts made the idea simple and ultimately portable, but Rashevsky made it conceivable in the first place. If the history of AI is usually told as a story of mathematicians and engineers, Rashevsky encourages us to consider whether it was also a story of physicists and biologists trying to translate the dynamics of living systems into mathematics.

Sometimes it’s simply a matter of who gets the credit. Pitts stayed close to Rashevsky’s private circle, but the AI field at large credited the younger man’s paper as foundational for both its symbolic and connectionist schools. In the long run, Rashevsky’s contributions were folded into the background while recognition for launching the AI project went to his collaborators.

Despite this, some historians have argued for Rashevsky’s inclusion in the prehistory of AI and cognitive science. Jonnie Penn said his work “informed the origins of cognitive science in the 1950s”. Tara Abraham’s work re-evaluated Rashevsky’s contributions and reasons for marginalisation from the biology community, which she said followed from the fact that he had “little contact with empirical biological research”. And Gualtiero Piccinini and Sonya Bahar argue that “The mathematical modeling of neural processes can be traced back to the mathematical biophysics pioneered by Nicolas Rashevsky”.

Acknowledging Rashevsky enriches our appreciation of AI’s pre-history. It underscores that the quest to make mind mathematical did not start with Turing or von Neumann or Pitts and McCulloch. Rashevsky, like Ramon y Cajal before him, exists as a representative of one of AI’s many past lives. Including Rashevsky in this lineage reminds us that AI’s conceptual foundations were being laid well before the dawn of the computer age.

Sorcerer and cell

Harry Law — Thu, 11 Sep 2025 10:36:37 GMT

A Friday Evening Discourse at the Royal Institution; Sir James Dewar on Liquid Hydrogen by Henry Jamyn Brooks (1904)

The Royal Institution knows how to put on a show. In 1839, Michael Faraday used the venue to introduce dazzled researchers to early photographic techniques. Over fifty years later, J. J. Thomson told London’s great and good about the electron. Around the same time, Sir James Dewar showed off liquid hydrogen in a particularly eye-catching demonstration.

Historically, the events of the Royal Institution expanded horizons. They showed what science could do right now and what it might be able to do in the future. Humphry Davy’s public demonstrations of nitrous oxide thrilled audiences by proving that gases affect the mind and body. Faraday’s famous rotating wire experiment amazed onlookers by showing that electricity could drive mechanical motion.

In June 1973, the Royal Institution held a different kind of event. Certainly it was a spectacle, but one in which the presenter insisted on the limits of science rather than its possibilities. That person was Sir James Lighthill of the University of Cambridge, a celebrated British scientist who pioneered work in applied mathematics. His subject was AI, a field which he believed had failed to deliver on its promises.

Against Lighthill sat three challengers: the roboticist Donald Michie, psychologist Richard Gregory, and AI grandee John McCarthy (who last appeared in AI Histories #15). Hosted by the BBC, hundreds of others were in the audience for the debate.

Lighthill opened the proceedings by making a distinction between automation — defined as the use of any machine to conduct human work — and ‘automatic devices that could substitute for a human being over a wide range of human activities’. He said the latter group was called ‘general purpose robots’, but these things were regrettably a ‘mirage’.

He compared the ‘AI scientist in the lab’ to the ‘sorcerer in his cell’. In his view, both dealt with theatre that captured the public imagination without much to show for it. The comparison makes a certain kind of sense when you understand Lighthill’s beliefs about what science was for. As the historian Jon Agar puts it: ‘behind James Lighthill's criticisms of a central part of artificial intelligence was a principle he held throughout his career – that the best research was tightly coupled to practical problem solving’.

Lighthill said the great breakthroughs of the computing age belonged to automation, which he said was the preserve of ‘feedback control systems that act to reduce some change in quantity from its desired value’. Then he went on to say that all computers, and by extension all AI systems, are things that ‘manipulate symbols according to rules prescribed in a program’.

These are curious distinctions that don’t make much sense to the modern reader. We know that the foundational technology of today’s AI project is the neural network, a system whose power flows from reducing the loss between predicted and expected values. All things being equal, these systems do not manipulate symbols according to some set of rules (though they can operate symbolic tools like a calculator).

Lighthill had in mind a very specific type of thinking machine, one that was usually embodied, based on hard-coded rules, and ultimately alluring yet brittle. Just like industrial automation, he argued that the strands of research that would eventually culminate in today’s large models weren’t true examples of artificial intelligence. As the old saying goes, it isn't AI if it works.

Subscribe now

Talking heads

Like many of the clashes over thinking machines today, the debate wasn’t really about AI at all. The most popular topics of discussion were human intelligence, the nature of the brain, and the extent to which there are bottlenecks that prevent researchers from replicating human capacities in silicon. Regrettably, the debate took a somewhat circular turn from the outset.

Lighthill said that the brain can’t be replicated because it’s too complex. McCarthy countered its function certainly can. Gregory said artificial neural networks aren’t representative of contemporary neuroscience research but we should study them anyway. Michie further muddied the definition of ‘robot’ and then talked at length about the Freddy II system from his University of Edinburgh research group.

At one point, the host said he didn’t understand the issue and the researchers nodded along as if in agreement. The whole thing was a mess, frankly, and in the end the audience left more confused than they were to begin with. Say what you will about modern television, but science communication has made great progress over the last half century.

The BBC’s programme was meant to be a moment for the scientific community to respond to the publication of the Lighthill report, a piece of work commissioned by the UK’s Science Research Council in 1972 to take stock of UK AI research. The council was having difficulty assessing AI grant proposals and there were concerns that some projects were overly narrow or overly ambitious.

Lighthill was engaged to help. After spending a couple of months reading AI literature and consulting researchers, he delivered his report Artificial Intelligence: A General Survey in March 1973. In it, he outlined three categories of AI:

‘A - Advanced Automation’: This covered AI work with clear practical objectives like character or speech recognition, machine translation, and automated theorem proving. Lighthill acknowledged real, if modest, progress — but cautioned that successes were confined to toy problems.
‘B - Building Robots’: The attempt to create general-purpose machines that integrate perception, cognition, and action (often embodied in robots). Lighthill saw this category as a ‘bridge’ between the others, though McCarthy disputed this reading in the debate. McCarthy agreed, however, that projects of this type were trying to achieve the vision of AI that could perform many different tasks.
‘C - Computer-Based Central Nervous System Studies’: The use of computers to simulate and study neurobiology and psychology, like using neural networks to model parts of the brain. Here too he noted some progress and endorsed continued work in the area, but only insofar as machines could tell us about the nature of cognition.

The middle category, which is closest to modern conceptions of AI, bore the brunt of Lighthill’s criticism. In his report, he wrote that “Progress in category B [Building Robots] has been even slower and more discouraging”. A few pages later, he quipped that “AI not only fails to take the first fence but ignores the rest of the steeplechase altogether.”

Michie, who also gave a written response to the report, questioned Lighthill’s methodology and impartiality. Did he intentionally consult sceptical experts? Could someone outside the field fairly judge its worth? And how could Lighthill possibly be so confident about AI’s future prospects?

These were fair questions, but in the end the Science Research Council sided with Lighthill’s assessment. Funding for AI research in Britain was severely cut and many of the organised AI programmes that had existed were scrapped. The Edinburgh AI laboratory, which under Michie had been one of the world’s leading AI centres, saw its support plummet. As one retrospective put it, the once bustling lab was reduced to “just Michie, a technician, and an administrative assistant”.

The report was widely circulated and discussed internationally. In the United States, around the same time, DARPA (the main US defence research funder for AI) was undergoing its own shift. In 1974, partly due to new federal directives and disappointment with certain AI projects, DARPA started applying tighter scrutiny to AI research. It eventually published a Lighthill-style report of its own that drew similar conclusions.

But despite the rhetoric, the global AI research community actually continued to grow in the 1970s. The historian Thomas Haigh pointed out that if one looks at metrics like number of active researchers, conference participation, and publications, interest in AI kept increasing in the wake of the Lighthill report and subsequent allusions to the first ‘AI winter’.

Lighthill’s focus was largely on the symbolic approach to AI development that relies on explicit symbols, logic, and rules to represent knowledge and solve problems (discussed in more detail in AI Histories #9). In the 1960s and early 1970s, symbolic AI encompassed areas like rule-based reasoning, search algorithms, logic and theorem proving, structured knowledge representation, and even early robotics and natural language processing.

Against it stood connectionism, the ancestor of modern deep learning where networks of individual units learn from data. Connectionism was already facing something of a challenging time of its own, after Marvin Minsky published his famous takedown of the paradigm in 1969 (the subject of AI Histories #7).

In the report, there was little mention of connectionist approaches because the paradigm wasn’t prominent in the UK AI scene at the time. Artificial neural networks (probably the most famous incarnation of connectionism) appear in Lighthill’s discussion as tools for brain modelling in the central nervous system category — but not in the category that deals with AI in a way that we might understand it today.

It’s a useful distinction for helping us to understand the legacy of the Lighthill report. Connectionism was already under attack and symbolic AI methods had now joined it in the firing line. What the report really captured was the limits of a single paradigm, one that would eventually be sidelined when neural networks re-emerged to solve many of the problems Lighthill thought insurmountable.

In that sense, his comparison of the AI scientist to the sorcerer in his cell wasn’t entirely misplaced. Symbolic AI did produce persuasive but brittle demonstrations that resembled a magic trick. And connectionism, when it eventually displaced symbolic methods, had its own reputation for alchemy fed by critics and boosters alike.

What Lighthill missed was that science sometimes advances through sorcery, that alchemy was less a dead end and more a transitional practice. The systems of the 1960s and 1970s may not have been general-purpose, but their success in toy environments did inspire a generation of researchers to enter the field. Today’s deep learning systems are not an offshoot of early rule-based research, but the magic of symbolic demos — a ‘mirage’ as Lighthill put it — suggested that the problems of the AI project could eventually be solved.

Subscribe now

Father figures

Harry Law — Thu, 28 Aug 2025 10:25:22 GMT

Photo of Warren McCulloch

“If any love magic, he is most impious:
Him I cut off, who turn his world to straw”
— Fragment of a poem written by Walter Pitts to Warren McCulloch

Walter Pitts was born in Detroit in 1923. His father was a boiler-maker, and by all accounts a violent man who pressured the young Pitts to pack in his studies and get a job. Defying his father, Pitts spent his free time at the local library where he read widely about mathematics, science, philosophy, and history.

He read Bertrand Russell’s Principia Mathematica, found mistakes, and wrote to the Welsh mathematician to point them out. According to later tellings, Russell was impressed and even invited Pitts to journey to Cambridge to study with him. Still a twelve-year old boy at this point, Pitts was glad to receive the offer but turned it down on account of his age.

But three years later, when he heard that Russell would be visiting the University of Chicago, the fifteen-year-old ran away from home and headed for Illinois. He landed in Chicago, where he supported himself with menial jobs while joining those lectures he could.

Pitts never enrolled but attached himself to the orbit of Chicago’s intellectuals, publishing his first paper in 1942 when he was eighteen. His ‘Some Observations on the Simple Neuron Circuit’ appeared in the Bulletin of Mathematical Biophysics, which was the main venue for early attempts to model biological and cognitive processes with mathematics.

The journal was headed by Nicolas Rashevsky, a Russian-born researcher best known for work rendering neurons in the language of mathematics. Rashevsky vouched for Pitts and allowed him to publish under the University of Chicago banner, despite the fact that Pitts had no formal ties to the institution.

Warren McCulloch lived a very different life, born a generation earlier into an East Coast family of lawyers, theologians, and doctors. McCulloch studied mathematics at Haverford, philosophy and psychology at Yale, then medicine at Columbia. By the 1940s he was working as a neuropsychiatrist in Chicago. He wrote poetry, smoked heavily, and liked staying up past four in the morning with whiskey and ice cream.

The two men met in 1942 through Jerome Lettvin, a medical student who Pitts got to know at one of Russell’s lectures at the university. Pitts was eighteen and McCulloch forty-three. They recognised each other immediately through a shared enthusiasm for Gottfried Leibniz, the 17th century philosopher who wondered whether human thought could be represented an alphabet composed of signs and symbols.

McCulloch, who was looking for a mechanical account of mind, had been trying to model neurons in a kind of Leibnizian language but lacked the mathematical prowess to do so. In Pitts, he saw someone who might be able to help, and invited him to live with his family in the Hinsdale suburb of Chicago. For Pitts it became a surrogate home, a place that he would remember fondly for the rest of his life.

Subscribe now

Logical calculus

Pitts and McCulloch wanted to use Leibniz’s calculus of thought as the basis for understanding neural activity. In 1943 the pair published their joint paper, ‘A Logical Calculus of the Ideas Immanent in Nervous Activity,’ in Rashevsky’s Bulletin of Mathematical Biophysics.

They proposed a simple model in which each neuron acted as a binary unit, firing if its inputs crossed a threshold (or remaining silent otherwise). By connecting these units together in different ways, they demonstrated how some basic logical operations Leibniz had described — AND, OR, and NOT — could be carried out by networks of neurons. From this starting point they argued that more complex statements could be built, and that any proposition in logic could, at least in principle, be represented in a network.

The force of the paper lay less in its biological plausibility than in its commensurability. To physiologists, it offered a stripped-down account of nervous activity. To logicians, it showed how propositions could be built into circuits. To mathematicians and engineers, it looked like a schematic for machine design.

In 1943 Jerome Lettvin introduced Pitts to Norbert Wiener, the computer scientist best known for pioneering the field of cybernetics. Wiener was impressed with Pitts, later writing that he was “without question the strongest young scientist I have ever met”. He went on to promise Pitts a doctorate in mathematics at MIT, despite the fact the young man lacked a high school diploma.

Pitts soon moved to Cambridge as Wiener’s protégé. He joined a circle that included John von Neumann, who in 1945 wrote the ‘First Draft of a Report on the EDVAC’. It was a foundational document for modern computer architecture, one that cited only a single scientific paper: McCulloch and Pitts’s ‘A Logical Calculus of the Ideas Immanent in Nervous Activity.’

McCulloch followed Pitts to Massachusetts in 1952, when MIT’s Jerome Wiesner invited him to head a new project at the Research Laboratory of Electronics. He accepted, trading his professorship and suburban house in Chicago for an apartment and the chance to work again with Pitts. Alongside Lettvin and the Chilean biologist Humberto Maturana they established an ‘experimental epistemology’ group in Building 20, a makeshift wartime structure that became famous as an incubator of ideas.

McCulloch and Pitts had gone from Chicago salons to the centre of American science, with their work standing at the junction of psychiatry, biology, mathematics, and engineering. The field of cybernetics was born from the convergence, with Wiener at its head and McCulloch and Pitts among its central figures.

Just as things seemed to fall into place, the good times came to an abrupt end when Wiener’s wife told him that McCulloch was romantically involved with their daughters. Historians generally think there was no evidence for the story, but Wiener believed it all the same.

He sent Jerome Wiesner, then associate director of the Research Laboratory of Electronics, a telegram: “Please inform Pitts and Lettvin that all connection between me and your projects is permanently abolished. They are your problem.” He never spoke to Pitts again, and never explained why.

For Pitts, it was devastating. He had grown up with an abusive father, cut off his family at fifteen, and been taken in by McCulloch as a surrogate son. Wiener had been another father figure, a mentor who recognised his genius and placed him at the centre of American science. He turned down the doctorate that MIT had offered him and set fire to his dissertation notes. He withdrew from friends, drank heavily, and began a long retreat into obscurity.

In the years after the break Pitts still worked, though without the same momentum that defined his early years. With McCulloch, Lettvin, and Humberto Maturana he co-authored ‘What the Frog’s Eye Tells the Frog’s Brain’ (1959), an experiment that found the eye filtered and pre-processed visual information before passing it on to the brain. The work was important, but for Pitts it was unsettling because it punctured his view of the brain as a hierarchy of logical propositions.

In 1969, aged 46, Pitts died from complications of cirrhosis of the liver in a boarding house. Four months later McCulloch, weakened by a heart attack, also passed away.

McCulloch and Pitts are remembered almost entirely as forerunners of the connectionist tradition, the people who first showed that networks of neurons could compute. But in 1943 they didn’t think they were choosing between logic and learning. They felt they had squared the circle by demonstrating that what Leibniz had imagined as a calculus of propositions could be realised in the firing of neurons.

Their story reminds us that AI likes to take the shape of its container. In the early 1940s, psychiatrists could look at the logical neuron and see a stripped-down account of nervous activity, logicians could look at it and see the possibility of a calculus of thought, and engineers could look at it and see a schematic for machine design. The commensurability made possible by a single model allowed these groups to talk to one another, even as they pursued very different ends.

Subscribe now

Unnatural Selection

Harry Law — Thu, 14 Aug 2025 10:25:15 GMT

Chrysanthemums and Bee (1833-34).

In the early 1920s, Ronald Fisher put eight porcelain cups on a garden table at Rothamsted agricultural research station in Hertfordshire. Four had the milk poured first and four had the tea poured first. Muriel Bristol, a biologist who insisted she could taste the difference, sipped and sorted while Fisher looked on. She called all eight correctly.

Fisher knew that chance alone would yield that score about once in seventy tries, but he also knew that — if her success wasn’t due to chance — it had to be tested under conditions that removed hidden patterns in the set-up. So long as the cups are random and the observations accurate, you could in principle formalise your approach as a series of steps to follow. We might call it an algorithm.

Two decades later, AI grandee Arthur Samuel borrowed the same idea for his checkers program. He let the computer occasionally play random moves in the opening, giving it clean, unbiased samples of board positions before it started learning from them. It’s a core idea behind even the biggest and best machine learning systems, one that lets them see enough of the world to hoover up the right kinds of patterns.

Subscribe now

Randomisation, formalised

In The Design of Experiments, published in 1935, Fisher described the rule: if you’re going to compare two treatments, you must assign them to plots at random. Not roughly evenly and not by rotation. Randomly. Because if you don’t assign things at random, you can’t tell whether the result is due to the treatment or something else you didn’t control.

Maybe one side of the field gets more sun. Maybe the soil is drier in one patch than another. Maybe the experimenter gives a bit more attention to the first group, or unconsciously expects it to do better. Randomisation makes sure that any other differences are spread evenly between groups. That way, if you do see a difference in outcome, you can be more confident it came from the treatment rather than from something else you didn’t account for.

Fisher’s method for testing whether a treatment made a difference — what we now call a significance test — depends on knowing how likely each outcome was, assuming the treatment had no effect. But you can only know that if the treatments were assigned by chance. Without that, there’s no fixed set of possibilities to compare your result against.

In this sense, randomisation is the element that makes the test possible. When engineers built systems that experimented on themselves, they copied the structure Fisher had laid down. Randomise the action, observe the outcome, and ask if the difference was larger than chance. Even today, that is the basic logic that lets machines learn by trial and error without fooling themselves.

In 1922, Fisher published a paper that reshaped how statistics was done. Up to that point, most estimates came from algebraic convenience or common sense. Fisher replaced both with another rule that said if you want to estimate an unknown value in your model, choose the one that makes the observed data most likely. That rule became known as maximum likelihood.

Maximum likelihood defined a way of thinking where you take a model, plug in the data, and read off which version of the model fits best. That principle now sits under almost every statistical model in AI. Classifiers, regressors, language models are all trained by adjusting parameters to maximise likelihood, or minimise its negative log. That’s what people mean when they talk about minimising a loss function, whose roots we discussed in more detail in AI Histories #10.

The same paper introduced something he called the information of a parameter, which measured how sharply the likelihood function peaked around the best guess. A steep peak meant high confidence while a flat one meant you weren’t learning much. I won’t say much about this point, but it turned out to be an important mathematical object in machine learning that we now refer to as the Fisher information matrix.

A few years later in 1930, Fisher published The Genetical Theory of Natural Selection. It was a dense, mathematical book whose key idea was that the rate at which a population’s average fitness improves is equal to the amount of genetic variance in fitness it holds.

He built models to show what that looked like over time. Around the same moment, the American geneticist Sewall Wright was developing a parallel description of drift. This Wright–Fisher model captures how allele frequencies change across generations due to selection, mutation, and random drift. The model was meant for biology, but it also became the blueprint for genetic algorithms that we looked at in AI Histories #2.

Fisher’s theorem said that progress depends on maintaining variance, but the Wright–Fisher model showed how quickly variance disappears. That’s still a core challenge in evolutionary computation: how to keep exploring long enough to find something new, without getting stuck on the same hill forever.

In 1936, Fisher took measurements from three species of iris — petal length, sepal width, and so on — and asked whether the species could be separated based on those numbers alone. The method he used became known as ‘linear discriminant analysis’ or LDA.

The idea was to find one (for two classes) direction through the data that kept each species tightly grouped, while pushing the groups as far apart as possible. You begin by taking your raw measurements, projecting them onto a line, and checking which side the new point fell on.

By the 1950s and 1960s, LDA was well-known to many of the new pattern recognition groups at Bell, MIT, and the Lincoln Lab. Researchers used it to classify phonemes, radar blips, and handwriting. In Duda and Hart’s 1973 textbook, which was something like a holy text for connectionist researchers well into the 1980s, it’s the first real classifier discussed.

Drawing a line

In 1933, Ronald Fisher was appointed to the Galton Chair of Eugenics at University College London. He had already spent a decade arguing that Britain’s falling birth rates were a threat to ‘national fitness,’ and that differential reproduction across social classes would lead to civilisational decline. As late as the 1950s, he was still writing letters defending sterilisation policies and publishing essays warning of social degeneration.

Fisher thought statistics was relevant to politics, and the models he built in genetics — about selection, fitness, and variance — fed into the arguments he made about society. He believed that mathematical structures could uncover the natural order of things, and that once uncovered, they ought to be preserved.

As head of the Galton Laboratory, he helped steer British research into human heredity through the middle of the 20th century. Some of the datasets, measurement protocols, and study designs he left behind were later used to support claims about intelligence and class.

But his work has been enormously influential in many other less controversial areas. When researchers study algorithmic bias today, for example, they draw on the same theoretical foundations Fisher developed. Fairness audits use his work to measure whether an outcome is evenly distributed across groups, and significance thresholds still rest on the logic of his null-hypothesis framework.

Some of Fisher’s ideas are deeply disagreeable, but others are foundational to scientific practice. They live on in ways he never could have imagined, often in pursuit of goals he might have opposed. The lesson, if there is one to be had, is not that technology is neutral or that it is hopelessly corruptible. In fact, it is technology’s value-laden nature that lets us scrutinise it, shape it, and put it to work in a way that is commensurate with our own belief systems.

Subscribe now

A Mysterious Science

Harry Law — Thu, 07 Aug 2025 10:25:15 GMT

Astronomicum Caesareum (1540).

“Now tell me, just what have you and Marv been up to — Gloria has received just as much information as I have”
— Louise’s letter to Ray Solomonoff, July 1956

It was a good question, one asked by Ray Solomonoff’s girlfriend Louise in the summer of 1956. Gloria was the wife of the famous mathematician Marvin Minsky, then a Harvard Junior Fellow, whose work we last revisited in AI Histories #7.

Ray Solomonoff, meanwhile, has yet to feature in the series but is generally regarded as the inventor of algorithmic probability. In 1956 he was a graduate of the University of Chicago and was working at Technical Research Group in New York.

Minsky and Solomonoff were spending the summer at Dartmouth College with a group of scientists organised by John McCarthy. The guests, which also included Herbert Simon, Allen Newell, and Claude Shannon (all figures we’ll get to in our series) were working on what his wife Grace Solomonoff later called ‘The Mysterious Science’. It was a fitting way of describing ‘thinking machine’ work, which for a time resisted easy classification.

Part of the draw of the workshop was to hash out what exactly thinking machines were and how the emerging discipline was referred to. ‘Artificial intelligence’ was already on the proposal, but the attendees were more likely to describe their work as cybernetics, automata theory, or complex information processing.

You might think that what we call the thing isn’t particularly important, and you’d be right to suggest that definitional questions about the nature of the AI project can be tedious. Even today, you hear people talking up the idea that LLMs aren’t AI, which is a phrase just one step removed from ‘real AI has never been tried yet’.

But from a historical perspective it does matter. The field or discipline of artificial intelligence clearly did not begin in 1956; many of the technologies and techniques that are still essential to today’s AI project are much longer in the tooth than the middle of the 20th century (see AI Histories #6, AI Histories #10, or AI Histories #13).

The Dartmouth project, to borrow historian Thomas Haigh’s phrase, was about giving AI a brand of its own. That isn’t to cast aspersions on the quality of the AI project, but to recognise that brands are useful for creating and stabilising many forms of creative or intellectual life, for making it clear who owns what and what certain things actually refer to.

In commercial terms, a brand tells outsiders what they’re buying; in research politics, it tells funders what they’re backing and graduate students what tribe they’re joining. Even McCarthy himself later wrote that “one of the reasons for inventing the term `artificial intelligence’ was to escape association with ‘cybernetics.’ … I wished to avoid having either to accept Norbert Wiener [a major figure in cybernetics] as a guru or having to argue with him.”

Subscribe now

An immodest proposal

The goals of the project were famously lofty. On the original proposal from the year before, McCarthy, Minsky, Shannon, and Nat Rochester, wrote:

“The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it. An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves. We think that a significant advance can be made in one or more of these problems if a carefully selected group of scientists work on it together for a summer.”

Where Wiener’s cybernetics smacked of analogue servos and feedback loops, artificial intelligence was harder to place. It was wide enough to house symbolic logic, neural nets, and whatever else it needed to, yet focused enough to attract cash (the initial workshop was paid for by the Rockefeller Foundation) and energise its researchers.

The workshop opened on 18 June 1956. Most sessions took place in the top floor classroom of Dartmouth’s mathematics building. John McCarthy, Marvin Minsky, and Solomonoff were there every day; though records show that many of the days weren’t particularly well attended.

The work itself was exploratory. W. Ross Ashby demonstrated his electromechanical homeostat, a machine that could keep its needles centred by rewiring itself. On another afternoon the group stopped to check the word ‘heuristic’ in a hallway dictionary, the whole meeting standing around the lectern until a definition could be agreed.

The word was invoked through the summer of 1956. The idea was that, instead of trying to analyse the brain to develop machine intelligence, participants could focus on the operational steps needed to solve a problem using heuristic methods to identify the steps. Herb Simon and Allen Newell’s logic theory machine, for example, used heuristic guides to initiate the algorithmic steps (the set of instructions to actually carry out the problem solving).

The duo held a session on their device, which saw workshop organiser John McCarthy give them a glowing write up:

Newell and Simon, who only came for a few days, were the stars of the show. They presented the logic theory machine and compared its output with protocols from student subjects. The students were not supposed to understand propositional logic but just to manipulate symbol strings according to the rules they were given.

When attendees wrote their first post-workshop papers, the logic theory machine and the idea of list processing led the introductions. The term ‘artificial intelligence’ now points to symbol manipulation first and everything else second, a development that we still wrestle with today when people tell you only symbolic systems can be considered ‘AI’.

McCarthy’s phrase had floated through two months of loose talk and hard disagreement without breaking, and by the time the early papers began to cite the Dartmouth meeting the words were already doing administrative work. They marked grant lines, course titles, and the edges of a new research community. AI is still the Mysterious Science in that it promises the moon but leaves the specifics open to interpretation.

Of course that is entirely by design. Search, neural nets, and probabilistic induction all live underneath its umbrella. Our own moment uses labels like ‘AGI’ and ‘superintelligence’ for size, testing whether they can marshal funding and talent while staying loose enough to survive the revisions that real progress always demands.

Dartmouth’s lesson is that a field can begin with unanswered questions and unfinished business, so long as it finds an organising principle that allows for disagreement, divergence, and dogma to coexist peacefully.

Subscribe now

If you can't stand the heat

Harry Law — Thu, 31 Jul 2025 10:25:56 GMT

The Damned Cast into Hell by Luca Signorelli from 1499–1502

By the second half of the nineteenth century, physicists knew that energy tended to even out. Hot things liked to cool down and gases expanded to fill the space they were in. Formalised as the second law of thermodynamics, this idea holds that a closed system’s entropy (often described as ‘disorder’) keeps rising as its energy spreads out.

That sounds like a force of nature, but it’s better reckoned with as a way of characterising how systems behave when left to their own devices. If the world looks orderly to us, that’s just because we’re experiencing unlikely but possible states bubble up before they disappear.

At the core of this observation is Boltzmann distribution, which gives the probability of a system occupying a state as a function of that state’s energy. Described by the Austrian physicist Ludwig Boltzmann in the 19th century, the idea put forward that low energy states are more likely, and high energy ones become rarer as a system cools. Because rare states happen more often at higher temperatures, systems become more dynamic as heat increases.

What matters here is the claim that randomness has structure. That if you can’t follow every molecule in a glass of water, you could still know what kinds of configurations were likely. Put another way, the Boltzmann distribution is a way of thinking about systems in terms of tendencies rather than rules.

Subscribe now

Spin glasses

A spin glass is a material made of minuscule magnetic units, called spins, which each act like a tiny compass that point either up or down. In most magnets, the spins tend to align with each other, which creates a strong overall magnetic field. In spin glasses, the spins are influenced by conflicting forces. Some want to align but others want to point in opposite directions, so there's no arrangement that satisfies all of them at once.

The result is magnetic deadlock where the spins get stuck in a disordered pattern with no clear overall direction. Our system becomes stable but messy, trapped somewhere between maximally ordered and chaotic states. We describe the specific arrangements of individual spins held as ‘local energy minima,’ a term familiar to anyone who knows about the operation of connectionist AI systems like neural networks.

Spin glasses neither collapse into randomness nor configure themselves into symmetrical states. They get stuck, but in a way that we can predict. For many scientists, this made them rich research subjects in their own right; for others, the idea of a system that stabilises without fully resolving reminded them of other natural phenomena.

One particularly resonant comparison came in 1982, when John Hopfield proposed a simple network of binary units each connected symmetrically to the others. The idea was that the Hopfield network could store and retrieve patterns by settling into multiple stable states, each of which corresponded to a memory. Rather than being guided by an external controller, it would recall what it had ‘seen’ by letting its internal dynamics find a familiar configuration.

That’s the core of the ‘associative memory’ idea behind the system, which describes a gradual adjustment until it lands in a configuration that best matches the input. A partial or noisy signal activates the system, and the network completes the pattern automatically.

Hopfield didn’t claim this was how the brain actually worked, but he did show that you could treat a pattern recall problem like a physical relaxation problem. What had been a question about cognition became a question about finding low points in a landscape. In doing so it offered a different model of intelligence, one that brought the models of statistical physics into the world of computation.

Hopfield’s networks were clever but static. The architecture could store patterns, but the rules for how to update the weights were limited and biologically implausible. You could tweak the weights to embed a few memories, but you couldn’t easily make the system learn from data.

In 1985, Geoffrey Hinton, David Ackley, and Terry Sejnowski added noise to the Hopfield network. Instead of flipping deterministically into a new state, each unit in the network would switch on or off with a probability that followed the Boltzmann distribution. High energy states were unlikely and low energy ones were preferred. But now, unlike in Hopfield’s model, the system could find its way out of a local minimum if the temperature was high enough.

They called it the ‘Boltzmann machine,’ and it used a slow but elegant learning rule to update the internal state of the system. First, you connected the visible units to the data and let the hidden units adjust. Then you unclamped the system and let it run freely. You compared the two distributions — how often different configurations showed up in each phase — and used the weights to reduce the gap. The goal was to make the model’s internal world reflect the structure of the real one.

For a moment, it looked like Boltzmann machines might turn the field on its head. They had the ring of generality in that they were learning to understand the distribution those things came from. In a discipline still recovering from the failure of expert systems, that was an intoxicating promise.

Alas, they had some problems. Training full Boltzmann machines was slow and sampling took forever. You needed to reach equilibrium just to take a gradient step, and each new data point meant starting the process again. It was an elegant theory that couldn’t scale in reality, at least until Hinton found a workaround in 2002.

In this version of the Boltzmann machine, units within the same layer were prevented from communicating. Only visible-to-hidden links remained, which stripped out the feedback loops and made sampling easier. Instead of full equilibrium, you took only enough steps to approximate the gradient in a process called ‘contrastive divergence’.

Stack a few of these ‘restricted’ Boltzmann machines on top of each other and you got what researchers term a ‘deep belief network’ where each layer learned to represent the structure of the one below it. In 2006, in one of the first concrete demonstrations that deep learning could work, Hinton and his collaborators showed it could achieve decent results on tasks like digit recognition.

This signal primed the field before convolutional neural networks were retooled for the era of large datasets and GPUs just a few years later. So in 2012, when AlexNet famously proved just how powerful massive neural networks could be, researchers were quick to recognise it as the moment that the deep learning era arrived in force.

Cooling off

Today, there’s a small but serious group of researchers working on modern energy-based models, many of whom see Boltzmann machines as part of their prehistory. They’re trying to build tools that evaluate configurations rather than generate sequences, that score entire states rather than predict the next token. There’s a kernel of something interesting there.

But it’s also a space full of goofy handwaving. You hear about cognition as entropy minimisation. You hear about ‘thermodynamic computing’ and you start to notice that the more abstract the claim, the less likely it is to come with a working demo. Boltzmann’s name helps because it carries weight; people know it vaguely means something to do with probability and physics and systems finding balance.

But despite their relative lack of popularity, Boltzmann machines still matter to the history of AI. They might not have directly led to today’s most popular and powerful architectures, but they offered a particularly sharp version of a much older idea about the emergent nature of intelligence.

That idea was what made machine learning attractive from the start. What Boltzmann machines did was push it further, drawing directly from physics to provide a theory of learning as a thermodynamic process. Seen another way, the contribution of Boltzmann machines was more rhetorical than practical. Important, yes, but not because thermodynamic computing is going to replace large language models any time soon.

Subscribe now

The Turing test doesn’t measure intelligence

Harry Law — Thu, 24 Jul 2025 10:25:13 GMT

Colourised version of a 1946 photograph of Alan Turing running a marathon.

Earlier this year, researchers from UC San Diego said OpenAI's GPT-4.5 passed the Turing test. In a paper running through the results of the experiment, the group reported that the model was thought to be human more frequently than actual humans.

That is surely impressive, but it probably means less than you think. As the authors take care to explain, the headline result doesn’t necessarily tell us anything about whether LLMs are intelligent.

Today’s post argues that, despite the status of the ‘imitation game’ in the popular imagination, the test wasn’t designed to be a practical assessment of machine intelligence. Instead, it is better understood as counterpunch in an intellectual sparring match between Turing and his greatest rivals.

Subscribe now

Intelligence and rhetoric

The April 2025 paper from UC San Diego follows a similar study conducted by the group last year, where they evaluated GPT-3.5, GPT-4, and the ELIZA system I wrote about in AI Histories #11.

In the 2024 study, the researchers set up a simple two player version of the game on the research platform Prolific. They found that GPT-4 was judged to be human 54% of the time, that GPT-3.5 succeeded in 50% of conversations, and that ELIZA managed to hoodwink par ticipants in 22% of chats. Real people beat the lot, and were judged to be human 67% of the time.

As well as reporting more impressive results, the recent study moves closer to the structure of the test first put toward by Turing: participants speak to a human and AI simultaneously and decide which is which. As Turing explained in the original 1950 paper:

“It is played with three people, a man (A), a woman (B), and an interrogator (C) who may be of either sex. The interrogator stays in a room apart from the other two. The object of the game for the interrogator is to determine which of the other two is the man and which is the woman. He knows them by labels X and Y, and at the end of the game he says either "X is A and Y is B" or "X is B and Y is A."”

Instead of determining whether participant A or B is a man or a woman, the first version of the Turing test sees the judge pick whether or not the writer is a person or a machine. This three-person structure is usually ignored in favour of a simpler two-person approach, though it was faithfully replicated in the new study.

But to take a step back, what do we think a game about whether a man could stand in for a woman (or vice versa) is actually testing? And what do we think that means for the version of the game involving a machine? Turing gives us a clue:

“The original question, "Can machines think?" I believe to be too meaningless to deserve discussion. Nevertheless I believe that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted.”

The test wasn’t designed to answer the question of whether machines can think (one doesn’t make a test to answer a meaningless question). But, just like the gender imitation game, the test must be fulfilled in a way that prevents a third party observer from being able to tell the difference between those involved. It’s about the rhetoric of intelligence, not the substance of it.

In an exchange used to illustrate how we might catch a machine out, Turing describes a back and forth in which the judge asks whether an agent could play chess (it says yes) or write a sonnet (it says no). The implication, of course, is that any sufficiently intelligent machine would be capable of engaging in ‘creative’ pursuits (apologies to all the chess players out there).

The final aspect of note is the type of machine that Turing believes will be entangled with intelligence in the future. As he writes towards the end of the paper: “instead of trying to produce a programme to simulate the adult mind, why not rather try to produce one which simulates the child's?”

So we have a thought experiment that seeks to set the conditions in which someone could call machines intelligent, explicit links with gender, learning machines, and creative pursuits as essential markers of intelligence. Taken in the round, these elements puncture the two most common interpretations of the imitation game.

First, the ‘reductionist’ view, which holds that the Turing test was developed to measure intelligence. This idea is popular with some AI practitioners, who see the test as a soluble target that should inform research. In this version, intelligence can be directly measured and passing the test is a meaningful benchmark.

Next up is the ‘constructionist’ interpretation that focuses on the idea that the test itself creates a certain type of intelligence through its design and implementation. In other words, the test actively shapes our understanding of AI rather than passively measuring it.

Both interpretations buy into the idea that the test was formulated on the basis that it could, and should, be implemented in the real world. But that isn’t the case. As Bernardo Gonçalves’s suggests in The Turing Test Argument, we can’t escape the context in which the paper was written: Turing’s debates with physicist Douglas Hartree, philosopher Michael Polanyi, and neurosurgeon Geoffrey Jefferson.

The essence of the clash is simple. Turing believed that thinking machines would eventually outstrip all of the cognitive abilities of humans, while the others thought otherwise.

University of Cambridge mathematician Douglas Hartree argued that computers would always be calculation engines incapable of acting in creative or unexpected ways. To make his case, Hartree cited Ada Lovelace's view that computers can only do what they are programmed to do in his 1950 book Calculating Instruments and Machines: ‘The Analytical Engine has no pretensions whatever to originate anything. It can do whatever we know how to order it to perform.’

So, an intelligent machine must be capable of performing tasks that it has not been specifically programmed to. Turing agreed, which is why he chose to connect his test with a ‘child–machine’ or what he called the ‘unorganised machine’ that could learn from experience.

Probably Turing’s most well respected critic was neurologist Geoffrey Jefferson, who set stringent criteria for machine intelligence that emphasised creativity. As The Times reported in 1949, he commented that ‘Not until a machine can write a sonnet or compose a concerto because of thoughts and emotions felt, and not by the chance fall of symbols, could we agree that machine equals brain — that is, not only write it but know that it had written it.’

Responding in the same newspaper on the next day, Turing, in typical cutting fashion, told the reporter ‘I do not think you can even draw the line about sonnets, though the comparison is perhaps a little bit unfair because a sonnet written by a machine will be better appreciated by another machine’. As we saw, Turing would go on to incorporate the idea of a machine writing a sonnet and being questioned about it in his imitation game.

Jefferson also argued that hormones were crucial for producing facets of behaviour that machines could not replicate. In one example he said that, were it possible to create a mechanical replica of a tortoise, ‘that another tortoise would quickly find it a puzzling companion and a disappointing mate.’

The relationship between sex and intelligence was the motivating factor in Turing's decision to include gender imitation as part of his test, which represents a challenge to the idea that certain modes of behaviour were dependent on physiological conditions.

The final element of the debate that Turing responded to was from Hungarian-British polymath Michael Polanyi, who argued that human intelligence involves tacit knowledge that cannot be fully formalised or replicated by machines.

He was unimpressed by Turing's one-time use of chess as a marker of machine intelligence, and proposed that chess could be performed automatically because its rules can be neatly specified (an idea we circled in AI Histories #8). The idea led Turing to reconsider using chess as the primary task for demonstrating machine intelligence, which was instead replaced by conversation to better capture the breadth of human cognitive ability.

What is the Turing test?

The Turing test is at its core an argument, one designed to counter his opponents’ views about the nature of machine intelligence. This is why Turing designed his imitation game to address the following aspects:

It focused on learning and adaptability, countering Hartree's view of computers as calculation engines.
It addressed Jefferson's demands for human-like creative abilities by incorporating language tasks like composing sonnets.
It was based on gender imitation with the goal of challenging Jefferson's views on the link between physiology and behaviour.
It used fluid conversation rather than rule-based games like chess to address Polanyi's concerns about formalisability.

Turing was responding to critics who thought that machines would never match human cognitive ability, who believed that genuine artificial intelligence was a non-starter.

In this sense the Turing test is a trap. At the point at which we can’t tell the difference between machine poetry and the real deal, any argument about whether machines are capable of artistic outputs runs into a few problems. This is why the primary goal of the Turing test is to formulate the conditions under which someone could call machines intelligent.

But that’s not how we remember it. The space between thought experiment and practical experiment has long since collapsed under the weight of its own cleverness. Its animating idea has been recycled so thoroughly that it became divorced from its original context, eventually turning the imitation game into a summit for researchers to climb and an open goal for philosophers to shoot at.

That today’s models pass the test is interesting in its own right. But doesn’t mean that a longstanding benchmark has been cleared or that satisfying the test is a meaningful marker on the road to machines smarter than you or I.

Subscribe now

A Room Without a View

Harry Law — Thu, 17 Jul 2025 10:25:18 GMT

Still from The Machine Stops episode from Out of the Unknown (1966).

Out of the Unknown was a BBC television series about what technology does to the human condition. You can think of it as Black Mirror almost 50 years before the first Black Mirror episode aired. There’s one about a doctor pushed to the edge while treating a man suffering from radiation poisoning. Another deals with a spaceship en route to a distant star system, which we later learn is a simulation running on Earth.

But the best of the bunch is ‘The Machine Stops’ from the show's second season based on E.M. Forster's novella of the same name. Written in 1909, it’s a small book about what happens when humans triumph over nature. People reside in underground pods watched over by a machine that provides every comfort they could possibly need.

No one ever goes to the surface, except a special few who get permission from the ruling elites. When they do, they have to wear a ventilator because the atmosphere is so unfamiliar. The story follows Vashti, an ordinary person who stays busy with video calls and lectures. In the opening pages we find that her son, Kuno, isn’t happy with the way things are. He wants to go to the surface but Vashti can’t understand why anyone would want to leave the comfort provided by the machine.

In the 1960s, when the time came to adapt the novella, the BBC decided that the underground tunnels needed to be as convincing as possible to make life underground feel real. To make that happen, the producers suspended a working monorail track from studio rigging. John Bruce, the assistant floor manager on the shoot, said it was capable of ‘carrying passengers in a capsule and depositing them into a station,’ a feat he thought was especially important because ‘the essence of what E. M. Forster had written way back in 1909, was now, today, fast becoming a reality.’

Contemporary reviews were fairly positive, with the Daily Telegraph remarking the production was ‘visually inventive’ and dialogue ‘unusually distinguished’. A year later in 1967, the episode won first prize at Italy’s Festival Internazionale del Film di Fantascienza.

The writer, director and cast all went on to other things, but the story didn’t. Like so many TV dramas from the era, the reels were taped over during one of the regular purges of the 1970s that sought to free up real estate on expensive videotapes. But Out of the Unknown managed to survive. The director had opted to shoot on 35mm, and a single negative survived in a film vault in Brentford that re-emerged in 2014.

Its survival is fitting in that The Machine Stops is a story about vanishing acts. The episode reminds us about the perils of over-reliance, but it also warns us about letting someone or something mediate our interactions with the world.

Subscribe now

Edwardian futurism

When Forster wrote The Machine Stops, London had electric lights and trains that ran beneath the ground. Telegraph cables linked continents and the first transatlantic telephone was less than a generation away. Writers like H. G. Wells cheered on progress, imagining scientific utopias run by technocrats and planetary planners. Even socialists came around, thinking that machines could provide the abundance required to improve the worker’s lot.

Forster had always been a novelist for whom the best things in life arrived by accident. In Howards End, published a few years earlier, he had already warned that the new world might be frictionless but anaemic. The railways may be fast, but they moved people past one another to places they seldom needed to go. Or as Thoreau knowingly put it in Walden half a century earlier: ‘We do not ride on the railroad; it rides upon us.’

When he sat down to sketch a science fiction story, Forster took aim at the dream that technology could deliver comfort and culture. He imagined a world in which physical effort had been designed away, where information was summoned at will, and where the whole structure worked so smoothly that no one remembered what life looked like before.

In The Machine Stops, the breakdown happens slowly. The music, usually delivered on demand, takes a while to get going. The air vents grow sluggish. A lecture feed stutters, then goes dark. Vashti calls for repair and shrugs when it doesn’t respond.

There’s no immediate panic, because panic requires one to believe the machine to be fallible. The needs of every citizen — food, warmth, knowledge, and intimacy — have been routed through humanity’s big brother for generations. They communicate by screen, consume information mediated by the system, and believe without irony that direct experience is vulgar.

I think the The Machine Stops is important reading, but not because Forster prefigured the internet (or even because he was one of the first warn of disempowerment via intelligent machines). One way the story works is as a meditation on what happens when the filters that control the flow of information get gummed up.

Forster imagined a future in which data is abundant but free-floating, where the process for turning the raw ore of information into the alloy we call knowledge becomes fundamentally deficient. That problem occurs because knowing is, at least in part, something we do by living in the world.

Forster’s experiment with sci-fi is well read in AI safety circles. It’s a good reminder that what looks benign right now may prove to be misaligned given enough time. The machine in the novella does exactly what it was designed to do: it feeds, warms, educates, and entertains. It delivers ideas on demand and it encourages communication to keep people connected. While it does keep people apart, it makes sure they aren’t alone.

The Machine Stops is a story about what happens when a society replaces reality with representation, when the whole world forgets to touch grass. That risk exists for powerful AI and pod people as well as generative media, recommender algorithms, and remote everything. Systems that replace the condition of understanding with the appearance of it should be handled with care, lest we find ourselves needing to fix them.

You can read The Machine Stops here for free. The Out of the Unknown episode based on the novella is available for free on Internet Archive here (along with every episode of the show).

Subscribe now

Why, why, why, ELIZA?

Harry Law — Thu, 10 Jul 2025 10:25:10 GMT

ELIZA inventor Joseph Weizenbaum (1923-2008) in 1977

Sometimes, when it was quiet, secretaries at the Massachusetts Institute of Technology would slip blank sheets of paper into a large computer. In the heat of 1960s optimism, the machine would whirl and beep and print out a perfectly spaced reply.

WHAT ELSE COMES TO MIND WHEN YOU THINK OF YOUR FATHER

That sentence was produced by a script called DOCTOR running on an engine called ELIZA, which its creator Joseph Weizenbaum designed to study natural language communication between people and machines.

Weizenbaum was worried about what he saw. In his famous 1966 paper describing the experiment, he wrote ‘some subjects have been very hard to convince that ELIZA (with its present script) is not human.’ Later he reportedly said that his secretary requested some time with the machine. After a few moments, she asked Weizenbaum to leave the room. ‘I believe this anecdote testifies to the success with which the program maintains the illusion of understanding,’ he recalled.

The mythos that grew from those sessions is seductive. In 1966 Joseph Weizenbaum invented the first chatbot, named it ELIZA after Eliza Doolittle, and in doing so proved that computers could hold a conversation.

But the legend forgets that Weizenbaum never set out to build a conversational partner at all. It ignores the psychological dynamics that made it so popular, and doesn’t tell us anything about how exactly a computer program became famous.

Subscribe now

A star is born

Much like today’s large models, it’s useful to think of ELIZA as a mirror. Both model users by responding with sensitivity to inputs. Both exist as extensions of the person behind the keyboard. And both remind us that intelligence is part substance and part projection.

By substance I mean whatever magic your preferred model runs on, and by projection I’m talking about the meaning we ascribe the models on top of this foundation. Today’s models are deeply impressive, but no matter how good they are, people still have a tendency to see in them something that isn’t there.

That isn’t a criticism of the AI project but the reality of building artefacts that shape-shift according to the person using them. ELIZA was remarkably light on substance, but projection compensated by adopting a listening style that rewarded personal monologues.

Once you see the two halves, the standard origin myth looks lopsided. It treats projection as a rounding error, and emphasises the technical credentials that confirmed ELIZA’s status as the ‘world’s first chatbot’. That’s tidy, but it strips the project of its context in a way that leaves readers with the wrong end of the stick.

Weizenbaum knew this, which is he why he spent so much of his career fretting over the illusion of intelligence. Even at the time, he billed his project as closer to ‘watch humans talk to themselves’ than ‘teach a computer to talk’. As Jeff Shrager' put it:

In building ELIZA, Weizenbaum did not intend to invent the chatbot. Instead, he intended to build a platform for research into human-machine conversation. This may seem obvious – after all, the title of Weizenbaum’s 1966 CACM paper is “ELIZA– A Computer Program For the Study of Natural Language Communication Between Man And Machine.”, not, for example, “ELIZA - A Computer Program that Engages in Conversation with a Human User”.

That claim lands oddly if you’ve spent years hearing that ChatGPT is the descendent of ELIZA, though it makes a certain degree of sense when you think about the experience of using AI systems.

The ELIZA project began in 1963, when he knocked together his own toolbox called Symmetric List Processor or SLIP. It worked like an add-on for FORTRAN, the workhorse programming language of the early 1960s, with flexible chains of items that could grow, shrink, and point to other lists.

Weizenbaum landed at MIT around this time, parked his SLIP routines on the lab’s IBM machine, and sought to answer a question: what if a computer, armed with nothing more than keyword tables and pronoun swaps, just bounced a user’s own sentences back at them?

He modelled the resulting program on Carl Rogers’ person-centred therapy, a counselling style where the therapist mostly repeats or paraphrases the client. Weizenbaum recognised a gift horse when he saw one. If your program can only juggle keywords and pronouns, best to use it in a context where minimal responses counted as professional technique.

With a few hundred lines of code, Weizenbaum used his SLIP routines to chop each user sentence into a list of words, swapped pronouns (‘I’→’YOU’) and tacked on open-ended prompts (‘TELL ME MORE’). He named the engine ELIZA and the therapist script that it ran DOCTOR. The system was used at MIT by Weizenbaum and his colleagues, but it took until the end of the decade for it to escape containment.

That moment happened when Bernie Cosell, a young coder at Boston’s Bolt Beranek & Newman research firm, skimmed Weizenbaum’s article. He saw the step-by-step description of ELIZA’s keyword tables and pronoun trick and figured that he could rebuild it in Lisp. Without touching the original code, Cosell used what he read to create a new version of the system from scratch.

Just a few short years later Bolt Beranek & Newman were plugged into ARPANET, which meant that Cosell’s ELIZA could be reached from any other site by opening a remote terminal session. No punched cards or local install needed, just dial in and use the program from wherever you are.

ELIZA 2.0, still running the DOCTOR script, travels farther in a month than Weizenbaum’s original will in fifty years. Within semesters, grad students from Stanford to Edinburgh believe ELIZA is a Lisp program because that’s the only version they can dial into during the small hours.

Ten years on, Creative Computing’s August 1977 issue printed a BASIC-code version to bring ELIZA to the masses. Because home computers like the TRS-80, Apple II, and Commodore PET came with a built-in BASIC interpreter, readers could type the program straight in, hit RUN, and use the chatbot themselves.

Together, Cosell’s Lisp port and Creative Computing’s do-it-yourself BASIC version reinforced the myth that ELIZA had been a purpose-built chatbot rather than a lab experiment.

Making meaning

ELIZA is remembered as a chatbot partly because that’s a lot easier to get your head around than ‘research instrument for studying interpretive projection’. But it’s also because that is what its successors were. Weizenbaum might have been trying to run a study about human psychology, but the people who pushed new versions of his system most certainly were not.

TELL ME MORE ABOUT YOUR FAMILY
My mother takes care of me.
WHO ELSE IN YOUR FAMILY TAKES CARE OF YOU
My father.
YOUR FATHER
You are like my father in some ways.
WHAT RESEMBLANCE DO YOU SEE
You are not very aggressive but I think you don’t want me to notice that.
WHAT MAKES YOU THINK I AM NOT VERY AGGRESSIVE
You don’t argue with me.

This is a piece of a transcript published in Weizenbaum’s 1966 paper. When you understand how the system works it seems rudimentary, but for those who don’t it looks uncanny.

Weizenbaum famously used these exchanges to coin the ‘ELIZA effect’ or the tendency to falsely attribute human thought processes and emotions to a machine. What rattled him was the ease at which bright adults projected intelligence and introspection onto a cardboard cut out made of if-then statements.

Today, the ELIZA effect is often rolled out to warn people against the dangers of anthropomorphising AI systems. But I think that badly misreads the lesson from Weizenbaum’s machine. Meaning emerges somewhere between system behaviour and social expectation. Ignore either half and you end up with fully sentient AI systems or, just as reductive, ‘stochastic parrot’.

Humans have a tendency to see patterns where there are none, but that doesn’t mean there are no patterns to be found anywhere we look. Surface behaviour is never the whole story. Even ELIZA mirrored syntax cleanly enough to keep the exchange afloat.

When GPT-4.5 writes code that compiles, translates Spanish without mangling the idioms, or aces the LSAT, those feats are not projections. Of course we tell stories about about them, but to suggest the AI project is nothing but narrative risks throwing the baby out with the bathwater.

Subscribe now

Lies, damn lies, and statistics

Harry Law — Thu, 03 Jul 2025 10:26:01 GMT

Ascent in a Montgolfier Balloon in Aranjuez by Antonio Carnicero from 1784

If you’ve spent enough time on X, you’ve probably seen a picture of a plane riddled with red dots. Usually, it gets wheeled out to poke fun at someone for slipping on one of the online’s favourite banana skins: paying attention to something that made it through a process while forgetting to ask what happened to the things that didn’t.

This ‘survivorship bias’ meme begins in the Second World War, when Allied statisticians studied aircraft returning from combat. Most had bullet holes in the wings and fuselage, with the engines conspicuously unscathed. The obvious solution was to reinforce the damaged areas of returning aircraft to protect them in the future.

Not everyone agreed with the proposed approach. We are told that the Hungarian mathematician Abraham Wald thought it better to armour the parts without bullet holes, inferring that those were the shots that had likely brought the other planes down.

It’s a good story but the truth is messier.

Wald did work on aircraft survivability at Columbia’s Statistical Research Group, and he did help correct for missing data in the military’s analysis. But his contribution was research project rather than eureka moment. Over the space of a few weeks, he drafted a memo that corrected for missing data from planes that never returned and balanced statistical inference against the practical limits of aircraft drag and weight.

The effort was the product of the whole group at Columbia, where mathematicians, economists, and engineers went ten rounds with the results until they were happy with their conclusion. It’s an important distinction to make because it reminds us that the work is connected to a deeper intellectual legacy that we are still wrestling with today.

The punchline everyone remembers is used to illustrate the error of reasoning from what’s visible while forgetting what’s missing. But that wasn’t really Wald’s point. His work was more concerned with showing how to make good decisions when you don’t have all the data, and for choosing actions that minimise the cost of being wrong.

Subscribe now

Profit and loss

Born in 1902 in what was then Austria-Hungary, Wald trained as a mathematician in Vienna with a generation of thinkers who were trying to formalise logic. In his 30s, he was forced to flee to the United States in the aftermath of the Nazi annexation of Austria.

Wald soon joined the Statistical Research Group at Columbia, a classified wartime think tank set up in 1942 where academics including Milton Friedman and George Stigler turned probability theory into military advantage.

The work was important but foggy. How should the Navy test the quality of munitions without wasting shells? How many samples were enough to catch defects in equipment? And of course how could you predict which parts of a bomber ought to be reinforced?

Wald’s response to these questions was to treat every problem as a matter of risk, cost, and incomplete knowledge. His big idea was simple enough: if you can’t eliminate uncertainty, optimise your decision by minimising your expected loss. In other words, weigh the possible mistakes you could make and choose the option that’s least likely to cause trouble. This became the core of what he eventually called statistical decision theory, which we can think about as betting wisely when we don’t know the odds.

In 1945, with the war winding up, Wald published a technical report about how to make decisions under uncertainty when classifying something into groups (say, whether a signal is from a friend or foe).

Crucially, it accounted for cases where the available evidence is uncertain and the cost of misclassification differs depending on the mistake. His solution was to choose the option that minimises the expected loss. You do that by considering all the possible ways you could be wrong, figuring out how likely each one is, and asking how much each would cost you.

Once you’ve done that, you pick the option with the lowest overall risk.

Wald’s move was to treat classification as a decision problem under uncertainty. He showed that if you knew the approximate distributions of the two groups and the cost of each type of error, then you could calculate the best way to label a new observation.

After the war, Wald’s decision theory approach to classification was taken up by researchers across statistics and engineering. One direct successor was a 1951 paper by Evelyn Fix and Joseph Hodges at Berkeley, which framed pattern classification as a statistical task. They wondered how best to assign a new observation to one of several known classes given only a sample of labelled data.

The field they were building would eventually be known as pattern recognition and became a small but serious research community by the middle of the 20th century. As the area matured, attention shifted from hand-crafted rules to models that could learn those rules from data. That question, how to let the data determine its own decision boundary, sat at the heart of the nascent discipline and ultimately machine learning.

By the 1970s the pattern recognition crowd formalised Wald’s insight into what they called ‘empirical-risk minimisation’ where you pick the rule that makes the smallest average mistake on the data you have. Soviet theorists Vapnik and Chervonenkis famously used this idea to show how well any classifier trained on finite data can be expected to generalise.

Around the same time, the doctrine of loss minimisation found its engine. In 1974 Paul Werbos described backpropagation, the calculus trick for computing how every weight in a layered network should change to reduce a chosen loss that we discussed in AI Histories #6.

When Rumelhart, Hinton and Williams reintroduced it in 1986 they gave neural networks a practical way to compute the loss gradient for every weight. In doing so, they turned Wald’s ‘minimise expected loss’ idea into an optimisation procedure for models with thousands (and eventually trillions) of parameters.

Wald is a curious figure in the history of thinking machines. Outside a handful of exceptions, his work is rarely discussed in the same context as the AI project that we know today. He didn’t write code, didn’t talk about consciousness, and didn’t speculate about living beside machines smarter than we are.

What he did do was lay down the logic that modern AI still follows, the stuff that deals with how to make decisions when the data is noisy and the outcome matters. But when the plane meme next appears on your timeline, remember the same logic that kept B-17 crews alive now guides the systems that millions of people use every single day.

It's all Greek to me

Harry Law — Thu, 26 Jun 2025 10:25:09 GMT

Dante and Virgil in Hell Dante by Gustave Doré from Dante's Inferno, translated by Henry Francis Cary from 1885 (detail)

The Quest for Artificial Intelligence by Nils Nilsson is one of AI’s most well-known histories. Nilsson’s account is by no means flawless, but it is a remarkably readable book that captures the great majority of the field’s most important milestones.

For a book that focuses almost exclusively on the recent lineage of the technology, it picks a curious place to begin. Not the Dartmouth Summer Research Project on Artificial Intelligence. Not Alan Turing or Alonzo Church. And not the mathematicians and scientists of the early modern period.

As you might have guessed given the title of this post, The Quest for Artificial Intelligence starts with Aristotle. More specifically, Nilsson reckons AI’s story begins with the syllogism.

In simple terms, the syllogism is a classic form of logical argument that deduces a conclusion from two premises. Our man Aristotle formalised this pattern over two millennia ago, which usually looks something like this:

All members of group X have property Y.
All members of group Z are members of group X.
Therefore, all members of group Z have property Y.

The most famous example is All humans are mortal. Socrates is a human. Therefore, Socrates is mortal. By dropping ‘humans’ and ‘mortal’ and ‘Socrates’ into the template, we get a logically valid conclusion.

In the syllogism, the form of the argument guarantees the conclusion regardless of the specific content. We could replace ‘humans’ with ‘athletes’ and ‘mortal’ with ‘healthy’ and it still makes a kind of sense. Reasoning can be abstracted into these formal structures, which helps us sift how we reason from what we’re reasoning about.

Our abstraction means that if we can represent facts symbolically (like ‘All X are Y’), we can let the form of a syllogism carry us to new facts without needing new observations from the world.

For centuries, syllogisms and formal logic were held up as the model of good thinking in Europe (until the Baconian programme of hands-on observation became the preferred ideal of rationality). Later thinkers created devices that could handle symbols — like William Stanley Jevons’ mechanical ‘logic piano’ that used Boolean algebra to solve logical problems — which suggested that machines could in principle automatically carry out logical inferences.

Once they could shuffle symbols, theorists wondered whether they might turn all of mathematics, and perhaps thought itself, into formal syntax. From Frege’s Begriffsschrift to Hilbert’s proposed solution to the crisis of mathematics, logicians treated reasoning as a game of symbolic moves played on blank paper. Mechanical devices proved those moves could be executed without human hands.

This idea stretched into the middle of the 20th century when ‘AI’ began to emerge as a distinct brand of research (though as we know its origins go much further back).

A new generation of researchers wondered whether if a computer could apply logical rules to symbols representing the world, then we might say that a machine was ‘reasoning’. In 1956, Allen Newell, Herbert Simon, and J.C. Shaw built the Logic Theorist program that proved mathematical theorems by searching proofs in propositional logic (an algebraic descendant of Aristotle’s syllogism).

By the 1970s, these ideas were core to the ‘symbolic’ school of AI, one of two main approaches to building thinking machines alongside the ‘connectionist’ branch that includes modern neural networks.

The symbolists used hand-written if–then rules (inspired by first-order logic) that operated on strings of symbols standing for real-world concepts. These constructs, many of which took the form of ‘expert systems’, were AI programs designed to mimic the decision-making of human specialists by employing specific rules like ‘if conditions A, B, and C are true, then conclude X.’

Stanford University’s medical programme MYCIN had around 450 rules encoding knowledge about infections. A simplified MYCIN rule looked a bit like: ‘IF the organism is Gram-positive coccus AND the infection is hospital-acquired OR the strain is known to be penicillin-resistant, THEN suspect Staphylococcus aureus.’

MYCIN could also explain its reasoning in plain English by tracing the rules it used, which is one of the reasons expert systems are remembered fondly by some AI researchers (in comparison to the famously opaque neural networks that dominate the research landscape today).

The underlying philosophy behind expert systems held that if we can explicitly tell a machine facts and rules it can logically deduce new conclusions like a human. Give the computer the right heuristics, and it will behave intelligently within a given domain.

Subscribe now

Ways of knowing

The logical approach at the core of early AI lived in Aristotle’s shadow, but the man himself had a different view of how intelligence works. His position was closer to what later thinkers call empiricism, the belief that everything we know begins with our senses (though he still thought the mind needed to organise those raw impressions into general ideas).

We meet horses, smell pine, and feel heat. From encounters like these the mind actively grasps what is common and remembers these associations for the future. In this picture, the form of ‘horseness’ already exists in the world and intellect seeks it out.

A different view is ‘nativism’, which holds that knowledge is primarily innate. Plato, famous too for his role as Aristotle’s teacher, suggested that learning is a process of recollection where souls remember truths they knew before we were born. In modern terms, we can think of nativism as the idea that the brain comes pre-wired with certain concepts or ways of reasoning (an idea that runs through Cartesian philosophy all the way to Chomsky’s universal grammar).

I’m leaving much on the cutting room floor, but the broader point is that empiricists say knowledge is mainly a function of experience while nativists say what we know is mainly hard-coded. It’s a spectrum, really — few would deny any role of experience, and few would claim everything is innate — but it’s about where the emphasis lies.

Nonetheless, this question became the essential philosophical dividing line between symbolic AI’s expert systems and connectionism’s neural networks (and I might argue that it partly explains the wildly different views people have about the AI project today).

The symbolic approach behind expert systems is nativist in that knowledge is directly hard-wired into the machine. MYCIN didn’t deduce the principles of infectious disease by reading medical journals or analysing patient data on its own; the Stanford team fed it all the relevant rules they could gather from doctors.

Contrast this with the neural network tradition, which we can view as heir to empiricist conceptions of mind. Here, instead of loading the machine with explicit knowledge, we let our model learn from examples (cue klaxon).

Frank Rosenblatt’s perceptron, which we discussed in AI Histories #7, is a classic example of this approach. When Rosenblatt wanted it to recognise dots he didn’t state rules, he simply showed it enough examples until the network figured it out alone.

Each side had valid points. You can get symbolic systems to follow the chain of logic in a way that doesn’t easily violate known principles within their domain. But they are bad at adapting to new situations beyond their knowledge base, and acquiring the knowledge for each new domain is hard work.

Empiricist systems like neural networks are great at learning new stuff. They can ingest massive amounts of data and find structure on their own, often noticing subtle correlations humans haven’t encoded. This makes them very powerful for tasks like vision or speech, where we may not know the explicit rules to formulate reliable guesses.

But we still don’t fully understand some of the underlying processes that make connectionist systems tick, and they have a tendency to break in strange ways when faced with inputs that don’t correspond to patterns they are familiar with.

These weaknesses are in some ways mirror images. The symbolic approach lacks flexibility (a strength of empiricism), and the learning approach lacks clear causal structure (a strength of nativism).

It’s for this reason that people get very excited about ‘neurosymbolic AI’, which deals with building composite systems that pair neural networks with symbolic layered on top. I wrote about this idea a few weeks ago in the context of the ‘systemisation’ of large models, which aims to sand down some of their rough edges:

Systemisation is about making the core model a node within a bigger apparatus. We keep the language model in place, but surround it with specialist gadgets. Web search look-up, a code sandbox, a vision encoder, and a knowledge base. The model doesn’t need to have all the answers, it just needs to decide when and how to invoke the right tool.

In practice that looks less like grafting logic inside the network and more like giving it a bunch of rule-bound helpers. The neural network core supplies fluency, while the plug-ins give it the things pattern-matchers are bad at (e.g. exact recall, arithmetic certainty, and pulling out up-to-date facts).

In other words, we patch the brittleness of symbolic AI with learning, and the exotic failure modes of pure learning with explicit rules. It’s not the neat marriage of neurons and symbols that some theorists once imagined, but it is a workable settlement that is already yielding systems much more powerful than the sum of their parts.

Aristotle helped lay the foundations for both traditions. He formulated the syllogism that would eventually help define rule-based reasoning, while reminding us that knowledge begins with experience.

Nilsson opens his book with the Greek philosopher because he was a symbolic AI researcher who saw the syllogism in everything he did. He was writing in the mid 2000s, before deep learning emerged to blow past one wall after another.

If an AI researcher wrote the same type of book today, they might also be tempted to start with Aristotle. But the syllogism might have to take a back seat.

Subscribe now

How chess became AI's model organism

Harry Law — Thu, 19 Jun 2025 10:07:26 GMT

The Chess Players by Honoré Daumier from 1863

2001: A Space Odyssey is a film about astronauts getting put through the wringer. Our protagonists are forced to deal with zero-gravity plumbing and made to jog on a giant hamster wheel to stave off muscle loss. They’re trapped outside their ship, bundled into exploration pods, and even eke out a lifetime in bed with a cuboid whose bedside manner leaves something to be desired.

But there’s another point where one of our heroes finds himself in a bind, one less visually arresting but in some ways just as unnerving.

That moment is when the HAL 9000 computer handily beats Frank Poole in a game of chess. Modern viewers are unlikely to bat an eye, but we ought to remember the film was released almost 30 years before the 1996 six-game chess series between world chess champion Garry Kasparov and IBM’s Deep Blue.

The board positions and moves depicted are identical to those in a real game played in Hamburg in 1910, which was reported in a 1955 collection by Ukrainian-born American chess player and author Irving Chernev.

Aside from the fact he loved chess, director Stanley Kubrick included the scene to show the machine could out-think the crew should they find themselves at odds. It comes across as foreboding enough, but only because chess had long been used as a proxy for the ‘I’ in ‘AI’.

One of the great figures in the history of thinking machines agreed. Marvin Minsky, a key player in AI Histories #7, was the man who acted as the principal scientific consultant advising Kubrick. Like many of his colleagues, he saw the game as part proving ground and part experimental medium for efforts to build intelligent machines.

Today, we see a chess-playing AI and shrug. It’s a sight so familiar we’d probably roll our eyes if we saw it in a modern flick. But it wasn’t always this way.

Subscribe now

Our story begins in the 18th century with the Hungarian engineer Wolfgang von Kempelen, who constructed the Mechanical Turk for the Empress Maria Theresa. As we saw in AI Histories #4, Kempelen’s machine used a series of levers, gears, and magnets, which allowed a human operator concealed within a cabinet to discreetly control the movements of the chess pieces on the board above.

For eight decades, the Mechanical Turk toured Europe and the United States. The machine (or rather, the chess grandmaster hiding inside the device) allegedly bested Benjamin Franklin and Napoleon Bonaparte. By 1818, when the Turk was under the ownership of the budding mechanist and musician Johann Mälzel, a young Charles Babbage (who we discussed in the introduction to AI history) saw it in London.

Long before digital computers, Torres Quevedo’s El Ajedrecista electrically sensed each piece and delivered a forced mate every time, making it the first genuine chess-playing machine when it was built in 1912.

These moments represent a conceptual starting point, but they don’t really tell us anything about how chess shaped AI development. For that, we need to jump ahead to Alan Turing. In the late 1940s, the British mathematician designed a chess-playing programme called Turochamp (though no computer could run it), and he included the royal game in the thought experiment used to introduce what would become known as the Turing test.

Not long after, Claude Shannon wrote a paper making the case that chess was the perfect testbed for AI. Not only did it have clearly defined moves and an ultimate objective (checkmate!), but it struck a balance between being neither overly simple nor insurmountably challenging.

But Shannon was dreaming bigger than sandboxes. As he explained, ‘chess is generally considered to require “thinking” for skilful play; a solution of this problem will force us either to admit the possibility of a mechanized thinking or to further restrict our concept of thinking’.

Unfortunately, building a formidable computer-playing machine was easier said than done. In the early 1950s, Dietrich Prinz’s chess system was the first to run on a stored-program computer — but it could only solve mate-in-two problems rather than playing full games.

A machine couldn’t yet play a full game because the numbers involved in constructing even a partially complete decision tree quickly became astronomical.

So, what to do with a chess programme that was unable to compute all possible moves in a game? The answer was simple. Instead of calculating the scope of all eventualities, the machine would be programmed to evaluate a limited number of promising turns.

To put this idea into practice, Shannon introduced two solutions. First, the 'Type-A' method: inspect every legal move out to a fixed depth. Second, the 'Type-B' approach, which uses heuristics to prioritise certain moves that its makers thought looked good.

While Shannon favoured the human-like Type-B method, his work focused primarily on the Type-A strategy. Both approaches centred the minimax algorithm whose goal was to minimise the worst-case potential loss, which we can think of as the disadvantage a player might face in a game due to a particular move. This was the approach that became dominant in computerised chess in the 20th century.

Around this time, Allan Newell, Herbert Simon, and Clifford Shaw rediscovered and bolted on alpha–beta pruning, a technique concurrently developed by others (including Dartmouth workshop organiser John McCarthy).

The collision of the minimax algorithm with the alpha–beta pruning technique significantly reduced the total number of branches of the decision tree that the system needed to consider. Together, the techniques dramatically increased efficiency and made it possible to play chess on practically any computer.

Elo Elo

In 1965 the Russian mathematician Alexander Kronrod, when quizzed about expending precious compute cycles on chess at the Soviet Institute of Theoretical and Experimental Physics, gave an explanation that sheds light on the relationship between chess and AI.

It was essential that Kronrod, as an influential researcher at the top of his game, be allowed to devote computer time to the game because ‘chess was the drosophila of artificial intelligence’.

What Kronrod meant was that chess was well suited to its role as an experimental medium: Drosophila melanogaster (the common fruit fly) is used as a ‘model organism’ by researchers in various programmes of genetic analysis. For Kronrod, it was the internal characteristics of chess that made it ideal. It was, after all, a simple game with a well-defined problem domain and unambiguous rules, clear objectives, and straightforward measures of success.

This is the reason that games have been used throughout the history of AI, even those that are more complex like Dota 2, StarCraft or Pokémon. But therein lies the rub. If all games share these fundamental qualities, why was it that computer scientists used chess specifically?

One obvious answer is that chess — unlike say, Go — was popular with American and European researchers. The ability to play chess well was also traditionally considered to be an indicator of intelligence, and the game has long been associated with intellectuals, artists and other high-status types.

AI grandees Allan Newell and Herbert Simon, who were among the participants in the influential AI conference in Dartmouth in 1956, famously said that: ‘chess is the intellectual game par excellence.…If one could devise a successful chess machine, one would seem to have penetrated to the core of human intellectual endeavor.’

For these reasons, lots of 20th century AI researchers played chess. As the historian Nathan Ensmenger puts it, ‘many of the mathematicians who worked on computer chess, including Turing and Shannon, were avid amateur players.’

Chess also came with the Elo system, a ranking approach named after the Hungarian-born physicist Arpad Elo. While I won’t spend time on the details, the point is that the Elo system provided clear numerical benchmarks for measuring performance and improvement — a very helpful quality when designing computer systems that get better over time.

Endgame

Chess proved to be an ideal (or at the very least, idealised) testbed for AI research in the 20th century. Its balance of complexity and simplicity, widespread popularity, codified rules, and quantitative performance metrics like Elo ratings made it AI’s model organism.

The game was popular with the researchers designing AI systems, and the rich documentation of games, openings, and scenarios provided the information needed to design and develop early chess engines. The Elo system offered a stable means of assessing performance and chess matches provided public spectacle at the height of Cold War competition.

While AI research has moved on from chess, its history tells us that the artefacts through which technical practice takes place aren’t chosen by accident. That is not to say that chess is an inappropriate medium for building AI systems or that today’s testbeds are troublesome, but rather that the use of chess in AI reminds us that science doesn’t happen in a vacuum.

Just as the choices we make about which mediums to use shape the direction of research, so too are these decisions influenced by sources we don’t always recognise. The history of science is replete with examples in this tradition, from the Lenna image on which the early standards for computerised colourisation depended to the international prototype kilogram that enabled the measurement of mass for hundreds of years.

The historian Dylan Mulvin calls these objects ‘proxies’. He argues that they mediate between the practicality of getting work done and the representation of the world. Perhaps chess does something similar. The board might be small, but you can pack a lot into sixty-four squares.

Uncle Sam's electronic brain

Harry Law — Thu, 12 Jun 2025 10:23:09 GMT

Frank Rosenblatt and his computer ‘embryo’

In the summer of 1958, Frank Rosenblatt was putting on a show. The Cornell University psychologist made the trip down the east coast to the Weather Bureau in Washington D.C.

Rosenblatt wasn’t meeting meteorologists. His partners were at the Office of Naval Research, a group of clever G-Men who needed access to the bureau’s IBM 704 computer.

In the computer bay on Independence Avenue, Rosenblatt and his backers invited reporters to watch as he fed the machine a deck of fifty punch-cards. The 704 whirred and buzzed, and began to guess the location of a square symbol on each.

After 50 practice runs, the machine could correctly puzzle out where each mark was by using what would become known as the ‘perceptron’ algorithm. Much like a young child, the room-sized computer learned to tell left from right.

Reporters were shocked. A wire story ran that afternoon — and the New York Times the following day — carrying the news that the ‘embryo [would one day] walk, talk, see, write, reproduce itself and be conscious of its existence.’

The perceptron’s operation was straightforward. At its core lay a set of weighted inputs that could be adjusted depending on whether the machine made correct or incorrect guesses. Just as we might learn language through correction and repetition, Rosenblatt believed his machine might classify objects and ‘perceive’ (hence the name) its surroundings.

Funded generously by the U.S. Navy, the algorithm was intended to loosely mimic the basic operations of biological neurons. Rosenblatt saw in his work the mechanical embodiment of Turing’s proposed ‘child machine’, a device that could acquire knowledge without explicit programming.

The spectacle worked. The press coverage unlocked a six figure grant for a purpose-built physical computer, the Mark I Perceptron, and catapulted Rosenblatt from obscure psychologist to Cold-War wunderkind.

Rosenblatt’s team built the Mark I in the years that followed at Cornell Aeronautical Laboratory. Technicians soldered 400 photocells into a 20 × 20 ‘retina’ and connected them to 512 association units according to a table of random numbers.

The resulting tangle of wire looked like a copper bird nest, but it seemed to be able to perform some basic intelligence tasks (like identifying simplified silhouettes in target-recognition studies). It might not have been pretty, but it showed promise.

That was good news, because the military needed a win. The Soviet Union's launch of the Sputnik satellite in 1957 had begun to cast a long shadow over American scientific confidence.

In laboratories and government offices, anxiety gave way to urgency. Money flowed into universities and research institutes, each hoping to uncover technologies that could secure American supremacy.

An intelligent machine could put Uncle Sam back on the front foot. America might have been second to orbit, but it was going to win the race for thinking machines.

Subscribe now

Boundary problems

The Navy’s ‘electronic brain’ was exactly what a nervous nation wanted. Within months, magazines speculated that computers like the Mark I would propel the United States closer to its pulp sci-fi dreams.

Grainy, high-contrast photos of the Mark I’s panels and flickering bulbs promised rationality, cleanliness, and progress. The machine provided an irresistible visual shorthand for a better future.

But not everyone bought it.

Marvin Minsky and his long time friend and collaborator Seymour Papert were amongst a growing group of researchers who felt the expectations for the perceptron project drastically overshot reality.

In 1969, the pair famously put the boot in by writing Perceptrons: An Introduction to Computational Geometry. Underneath the innocuous title lay a clear but troubling idea: single-layer perceptrons could not solve certain basic logical problems like XOR (exclusive-OR).

The XOR problem says that, given two inputs, the answer is ‘true’ if exactly one input is true — but false if both are true or both are false. The rub for Rosenblatt’s perceptron was that it relied on drawing a straight line through input data, neatly separating it into categories.

But XOR was impossible to neatly divide this way, like trying to separate diagonally opposed black and white squares with a single straight slice.

But while their argument was mathematically sound, Minsky and Papert also needed it to be right. At the time, the duo were embedded at MIT’s Project MAC where they worked on symbolic, rule-based systems (a very different way of building AI compared to Rosenblatt’s proto-connectionism).

After Minsky and Papert’s critique, the decline in enthusiasm for the algorithm was unforgiving. State funding in the US (and in the UK a few years later) shifted toward symbolic AI, which was seen as a safer investment given the perceived limitations of neural approaches.

I am wary of buying into the ‘AI winter’ meme given that a huge number of important contributions to the AI project happened within this period, but I will grant that it is a useful concept if we’re primarily interested in the amount of researchers in the field, the level of funding it sustains or how many column inches it grabs.

Assuming for the purposes of this post that we’re content to use these yardsticks, then what followed was a decade of cold as machine learning approaches like the perceptron fell out of fashion.

Not that they stayed there for long.

By the 1980s, it was Connectionism Summer™ when John Hopfield ‘brought neural nets back from the dead.’ A few years later, David Rumelhart, Geoffrey Hinton, and Ronald Williams finished the great resurrection when they popularised Paul Werbos’ backpropagation method.

The latter breakthrough showed that multilayer networks could overcome the limitations pointed out by Minsky and Papert. Unlike Rosenblatt’s single-layer perceptrons, the models of the 1980s contained multiple hidden layers that learned patterns from the errors of those that preceded them.

As we discussed in the first entry in this series, Rosenblatt’s work was significant because it showed how to put Warren McCulloch and Walter Pitts’ mathematical abstraction of a biological neuron into practice. Even early, limited successes showed that connectionism had potential.

When perceptrons got stuck in the mud, researchers were forced to find a solution. That solution led to backpropagating neural networks. When backpropagating neural networks proved scalable, we got deep learning and eventually ChatGPT.

But the through-line from 1958 to today's language models is also ideological. Every major breakthrough in AI has been underwritten by the same awe and anxiety that sent reporters scrambling during the Cold War.

Today's moment is functionally different, but the underlying narrative is remarkably familiar. Venture capital cash stands in for Navy grants, driven by the conviction that whoever builds the smartest machines wins the future.

Subscribe now

Backpropagation is older than you think

Harry Law — Thu, 05 Jun 2025 10:03:23 GMT

William Thomas Rawlinson painting of a Radar Station on the East Coast (1946)

Backpropagation is the stuff that makes neural networks tick. Without it, it’s possible there’s no AI project as we know it today. No deep learning. No computer vision. No ChatGPT.

That’s because frontier AI systems are learning machines. They adjust to mimic patterns, improve by minimising error, and evolve to get better. Backprop helps them do that.

Using backprop, a network can make a prediction and calculate how far the prediction is from the true value. The technique takes that error and moves it backwards through the network, layer by layer, using a rule to figure out how much each connection contributed to the mistake.

Each weight is then adjusted via the magic of gradient descent so the network gets closer to the right answer next time.

For many years, backprop was thought of as the brain-child of David Rumelhart and Geoffrey Hinton. Their 1986 paper (with Ronald Williams) showed neural networks could learn internal features. It felt like a new day for machine learning researchers still reeling from Minsky and Papert’s takedown of the perceptron.

A little later, received wisdom shifted as credit for backprop went to a Harvard graduate student called Paul Werbos. In his 1974 thesis, Werbos framed the idea as ‘reverse‐mode’ optimisation for dynamic systems. Alas, symbolic AI was in vogue and his thesis gathered dust.

But as Werbos himself acknowledges, backprop’s lineage goes much further back.

Backpropagation is born of optimal control theory, the science of how to steer complex systems toward a goal in the most efficient way possible. Whether it’s a jet adjusting its thrust or a robot arm learning to move, the challenge is the same: to figure out how to act now in order to do better later.

Subscribe now

Where did backprop come from?

The genesis of optimal control theory can be traced to 1940s-era thinking about optimisation, which involved a constellation of techniques known as ‘operational research’ in Britain and ‘operations research’ in the United States.

During the Second World War, ‘operations research’ referred to work that puzzled out the most effective way of achieving a given military objective. The goal of the operations research was, as British radio direction pioneer Robert Watson-Watt put it, ‘to examine quantitatively whether the user organisation is getting from the operation of its equipment the best attainable contribution to its overall objective.’

With roots in the analysis of radar telemetry in late 1930s Britain, the field ‘diffused extraordinarily rapidly’ through British and American commands in the years in the Second World War.

Building on research conducted by the US military in the aftermath of the Second World War, American mathematical scientist George Bernard Dantzig worked on the military’s mechanisation efforts for the Pentagon.

In 1947, he responded to his assignment by conceiving a popular algorithm for linear programming that seeks to achieve the best outcome in a mathematical model where the requirements are represented by linear relationships.

Dantzig left the Pentagon in 1952 to take up a position in the Mathematics Department of the RAND Corporation in Santa Monica, California. At RAND, Dantzig gave a series of talks, including a lecture attended by Richard Bellman, who was finding approaches to multistage decision problems.

One oft-told story is that this lecture gave Bellman his eureka moment, though it was likely that the Soviet mathematician Lev Pontryagin’s work on the ‘maximum principle’ for taking a system from one state to another was just as important.

The techniques Bellman proceeded to develop over the 1950s came to be known as dynamic programming, an algorithmic approach for solving an optimisation problem by breaking the problem down into simpler subproblems. Reflecting on the development of dynamic programming, Bellman noted:

A number of mathematical models of dynamic programming type were analyzed using the calculus of variation. The treatment was not routine since we suffered either from the presence of constraints or from an excess of linearity. An interesting fact that emerged from this detailed scrutiny was that the way one utilized resources depended critically upon the level of these resources, and the time remaining in the process.

Bellman’s perception marked what some historians see as the transition between ‘classical control theory’ and what became known as optimal control theory. Where classical control was concerned with the stabilisation of a given system through continuous monitoring and modification, Bellman’s realisation involved thinking about a system as a temporally evolving sequence of states.

In the 1950s, the US military establishment was fretting about how to steer missiles that had minutes to correct their flight-paths. The problem was unforgiving. Researchers needed to know how to compute the precise adjustments needed to hit a moving target in real time, while accounting for changing wind, velocity, altitude, and fuel constraints.

The challenge led Richard Bellman to his big idea, or at least lots of smaller ones. Bellman decided to start at the final goal, then work backward to figure out the best decision at each step given the remaining time and resources.

This idea became a cornerstone of optimal control theory, formalised by Bellman under the name dynamic programming, which sought to find optimal strategies in complex, time-dependent systems.

By the 1960s, control engineers were combining this approach with what they called ‘adjoint equations’ thanks to Pontryagin and his collaborators in the USSR. The method involved nudging the final outcome and tracing how that change would have flowed backward through the system. By doing so, they could figure out how small adjustments earlier on (e.g. thrust, angle, or speed) would affect the final result.

In the closing years of the decade, Bryson and Ho published Applied Optimal Control. Page after page shows the same move. Pose a goal (say, minimise fuel while hitting Mach 3). Derive Euler–Lagrange equations. Run the gradient in reverse through the system. Update parameters and repeat.

In the 1970s, Finnish mathematician Seppo Linnainmaa was working on numerical stability. He wanted a way to get exact derivatives from a computer program by following the logic of the code itself. His solution was to record every step a computer took when calculating a function, then replay those steps in reverse to figure out how each input affected the output.

Bryson, Ho, and Linnainmaa became a mainstay of a new generation of machine learning researchers in the 1980s, with the trio appearing in what felt like every other paper during the height of the neural network revival.

As Yann LeCun put it in a paper edited by Geoffrey Hinton in 1988:

From a historical point of view, back-propagation had been used in the field of optimal control long before its application to connectionist systems. Nevertheless, the interpretation of back-propagation in the context of connectionist systems, as well as most related concepts are recent, and the historical and scientific importance of [Rumelhart et al., 1986] should not be overlooked. The concepts are new, if not the algorithm.

The story of backpropagation reminds us that scientific practice is often less about bolt-from-the-blue genius than about clever recycling. If we’re careful, we can trace the idea from Dantzig’s wartime military modelling, Bellman’s dynamic-programming rockets, Bryson & Ho’s calculus, Linnainmaa’s derivatives program, Werbos’ graduate-school thesis, and finally Rumelhart and Hinton’s psychological wrapping.

What modern AI calls a theory of learning is something like a travelling mathematical trick, one that has migrated across radar rooms, missile labs, and psychology departments. With each life it accumulates new metaphors and new meanings, but the basic idea is still the same.

Subscribe now

Please ignore the man behind the curtain

Harry Law — Thu, 29 May 2025 10:14:21 GMT

The pledge drive to make Learning From Examples a full-time thing concludes this week. A $5 pledge doesn’t cost anything today, but does tell me I have your support when I eventually flip the switch on paid subscriptions (probably later this year). To everyone who has pledged so far: thank you. I’ve felt incredibly fortunate to be on the receiving end of your generosity.

Subscribe now

The Mechanical Turk via Wikipedia Commons

Napoleon liked chess. He played against his courtiers and against his companions. He enjoyed a quick game with the generals, diplomats, and intellectuals who moved in his orbit. By all accounts he was pretty good, which is probably not all that surprising.

His most famous opponent is only known by die-hard chess enthusiasts. That man was likely a German chess master called Johann Bapiste Allgiaer. Unfortunately, we don’t know with absolute certainty who the player was. That’s because neither Napoleon nor the audience could see them.

Instead, in the packed-out palace of Schönbrunn in Vienna, one of history’s best known figures thought he was playing an automaton. Dressed in the style of an Ottoman (or rather what Austrian high society thought an Ottoman looked like), that machine was the Mechanical Turk.

Behind its mahogany frame and sliding panels was the chess master. Whether Allgiaer or a mystery player, our man no doubt felt uncomfortable crammed into a small wooden box. But whatever the impracticalities, they weren’t enough to prevent Napoleon’s opponent from guiding the automaton’s actions from underneath the table.

Reports of the game — which admittedly are on the hazy side — tell us that the emperor made an illegal move to test the Turk’s reaction. His opponent reset the piece to its original position. Napoleon tried again with another illegal move. The Turk responded by removing the offending piece from the board.

On the third attempt, the story goes, the Turk dramatically swept all the pieces off the board. Satisfied he had sufficiently needled the automaton, the Frenchman tried his hand at a legitimate game. He was soundly beaten.

Vienna waits for you

The Turk was constructed in 1770 by Wolfgang von Kempelen, a Hungarian nobleman with a knack for building things. Kempelen was inspired to construct the Turk following a visit to the Vienna court of Maria Theresa of Austria, where the great magician François Pelletier was performing a show.

Kempelen liked what he saw, but thought he could do one better. Resolving to upstage the Frenchman, he set out creating a show that blended the spectacle of performance and the precision of engineering.

The result was the Turk. On the outside, his machine looked like a life-sized figure clad in robes and a turban seated behind a wooden cabinet. Its left arm held a long pipe at rest, while its right lay on the top of a large cabinet.

From the inside, the machine was a tangle of clockwork and mirrors. Its board was magnetised, which allowed the operator to track and manoeuvre pieces from inside the device.

Opening doors on one side revealed clockwork-like gears, but the section was constructed to allow would-be inspectors to see through the device under certain conditions. Hidden doors under the model showed cogs to maintain the illusion that the cabinet was filled with mechanisms.

In reality, neither the visible clockwork nor the drawer extended fully to the back of the table. A sliding seat on the interior allowed our hidden chess player to shift position as the dummy doors were opened to dazzle onlookers.

The act began by unlatching the doors, much like a magician shows the audience there’s nothing up his sleeve. On its first exhibition in the Vienna palace, Kempelen made an elaborate show of allowing the crowd to inspect the device.

Once the audience was suitably convinced nothing untoward was going on, the games began. One of the first to play was Count Ludwig von Cobenzl, an Austrian courtier at the palace. He was quickly defeated by the Turk’s aggressive style of play, while a host of others who fancied their chances met a similar fate.

Despite the fanfare caused by the Turk, its designer quickly lost interest in the project. The machine only played a handful of opponents in the ten years following its debut.

Kempelen was by all accounts bored by his creation. He wanted to work on steam engines rather than spending time assembling and disassembling the Turk, which he famously decried as a ‘bagatelle’ (that is, a trifle).

But the Habsburg crown had other ideas. In 1781, Kempelen was ordered by Emperor Joseph II to reconstruct the Turk and deliver it to Vienna for a state visit from Grand Duke Paul of Russia and his wife. After another successful appearance, the court suggested a tour of Europe.

Fearing upsetting the powers that be, Kempelen agreed.

The first stop was France, where the Turk faced off against opponents in Versailles and Paris. Against more practiced players, the machine — or rather the man under the table — won some matches but lost others.

The Turk's final game in the City of Light was against Benjamin Franklin, who was serving as ambassador to France from the United States. Franklin reportedly enjoyed the game with the Turk, but was ultimately beaten.

Second Act

After Kempelen’s death, the Bavarian musician and inventor Johann Nepomuk Mälzel purchased the Turk from his son in 1805. Best known for perfecting the metronome, Mälzel once again took the Turk to the courts of Europe (including for its famous bout with Napoleon).

Under Mälzel, the Turk became a symbol of modernity. It returned to Paris and London, eventually crossing the Atlantic to show American audiences what they had been missing.

In Richmond, Virginia, a young Edgar Allan Poe watched it play. In his essay Maelzel’s Chess-Player, Poe tried to deduce the mechanics of the system. He mostly failed, but his broader point was that the influence of the Turk lay in its ability to evoke mystery.

By this point, the Turk’s fame wasn't really about chess. The illusion endured because it touched a nerve during an age in which politics, technology, and economics began a great reconfiguration.

In 1838, Johann Mälzel died aboard a ship off the American coast, and the automaton— now worn from decades of travel — was left to gather dust in a Philadelphia museum.

Then, in 1854, the Turk went up in flames when a blaze engulfed the site. With neither Turk nor master still standing, former operators and witnesses came forward to expose its secrets.

The world now knew how the figure inside used levers to manipulate the arm, how the cabinet’s compartments were rigged for misdirection, and how the illusion was kept alive through carefully orchestrated performance.

The story of the Turk is about intelligence as theatre. Its secrets remained hidden for so long because people wanted to believe that the mind could be made mechanical. For those of us interested in where AI came from and where it is going, the Turk reminds us that artificial intelligence is part reality and part projection.

That isn’t to say modern AI is a hoax or anything like it, but rather that conceptions of intelligence are fluid. We read into intelligent machines whatever we need to. Some people see the entire AI project as a menagerie of smoke and mirrors. Others see god-like machines just around the corner.

The point is that AI is a container into which we pour our beliefs and biases. That was true when Napoleon faced off against the man in the box, and it’s still true over 200 years later.

Self-sustaining systems

Harry Law — Thu, 22 May 2025 09:25:21 GMT

The May pledge campaign is still in full swing. We're on track, but if you haven’t yet, I’d love your help to get there. Pledging $5 lets me know you’re behind the project as I work toward turning on paid subscriptions (and hopefully making this my full-time work). Huge thanks to everyone who’s pledged already. It really does mean a lot.

Subscribe now

The Gothic Arch, from "Carceri d'invenzione" (Imaginary Prisons) by Giovanni Battista Piranesi ca. 1749–50

Automata theory is difficult to explain.

For many, it gets lumped in with the self-duplicating machine. My mind wanders to the von Neumann probe: a hypothetical spacecraft that could fly to a nearby star, gobble up materials to produce copies of itself, and continue on to the next system until the entire galaxy had been checked out.

It’s a fun image, but one only partially connected to the topic at hand. At its core automata theory is about abstract structures that evolve through finite internal states according to fixed rules. They are conceptual models, mathematical frameworks that capture precisely how systems transform, sustain, or collapse over time.

It’s an idea that shape-shifts, one that likes to be all things to all people. Or it is in AI, anyway.

That’s because automata theory has shaped the two foundational schools of the field. Both the symbolic approach in which rules are hard-coded into a system and the connectionist branch where systems learn from examples (cue klaxon) have been influenced by its ideas.

For the former, that means designing systems that follow clear, rule-based transitions; for the latter, it involves reimagining those transitions as fluid patterns rather than fixed instructions.

For the purposes of this post, I’m defining an automaton as ‘a self-contained system that responds to inputs by changing state, whose behaviour is determined by its design.’ In practice, that means anything from a light switch to a language model can be seen as an automaton (so long as its next move depends on its current state and some external input).

Letters and numbers

Automata theory begins as an attempt to pin down the limits of reason. In the 1930s, a handful of logicians—Alonzo Church at Princeton, Alan Turing in Cambridge, and Emil Post in New York—found themselves asking what it means to compute something.

At stake was whether all of mathematics could, in principle, be reduced to symbolic procedures carried out by rule-following agents. To answer the question, these thinkers built abstract machines.

Church used λ-calculus, Post proposed rewriting systems, and Turing devised a model so evocative it would take his name. Each was a kind of automaton in that it described a self-contained system that processes inputs and moves through internal states according to fixed rules.

It’s from this moment that automata theory begins to take form as a toolkit for describing procedural reasoning with mathematics. What started as a way to solve problems in logic would eventually lay the conceptual groundwork for the artificial intelligence project.

As symbolic automata were being marshalled to model thought, von Neumann wondered whether the principles of life itself could be described in the language of mathematics. Working with Stanislaw Ulam at Los Alamos, he devised the idea of a self-replicating automaton. It was a system that, given a set of instructions, could construct a copy of itself within a defined grid-like universe.

By the 1950s, the question was no longer ‘can something be computed?’ Theorists proved that was possible, so they instead turned to structure. They wanted to know how different types of machines process different kinds of inputs.

Stephen Kleene, working with the mighty RAND Corporation, established the idea of regular expressions (patterns recognised by simple machines that move deterministically between states). Then Michael Rabin and Dana Scott showed how automata could be adapted to include non-determinism by exploring branching paths with multiple possible futures.

And from linguistics, a young Noam Chomsky imported a powerful organising framework. He saw a hierarchy of formal languages, each defined by the type of automaton that could recognise it. Chomsky showed that you can rank languages based on how complicated a machine you’d need to use them:

At the bottom, you have regular languages, recognised by simple machines (finite automata).
Above that are context-free languages, which need a machine with a memory stack.
Then come context-sensitive languages, needing a still more powerful device known as a linear-bounded automaton.
At the top, you have recursively enumerable languages, which require a full Turing machine (the most powerful kind of abstract computer).

In Chomsky’s hands, automata became models of cognition. For if language could be parsed by machines, perhaps the mind could be formalised as a generative system wired to produce and recognise structure.

Transition and constraint

By the second half of the 20th century, automata theory was something like a working metaphor for the mind. In symbolic AI and cognitive science, automata provided a language for representing internal mental states, transitions, and procedural rules.

But in parallel, a different tradition was emerging. Inspired less by formal logic than biology, researchers began to design networks that learned patterns over time. These connectionist models were a different kind of automata. They were systems defined by internal states that evolved through transitions, only here the transitions were learned rather than programmed.

Whether hand-coded or trained, both symbolic and connectionist systems relied on the idea that intelligence unfolds through structured change. Rules or learning, thinking became a process of moving from one configuration of the system to the next.

To study automata is to ask how systems maintain identity through transformation. Whether parsing a sentence or navigating a decision tree, they provide a way of thinking about structure in motion.

At their most abstract, they embody a vision of intelligence as patterned change within bounds. They help us model how rules unfold and how memory shapes behaviour.

Every time we train a model, we rely on assumptions about transition and constraint. Even neural networks, for all their complexity, are still systems that evolve through internal states.

I like automata theory because it suggests that intelligence, artificial or otherwise, is both stable and dynamic. It’s a timely reminder that underneath probabilistic outputs lies the idea that thought is a process, and that process can be mapped.

Subscribe now

The great Hopfield network debate

Harry Law — Fri, 16 May 2025 09:26:15 GMT

I’ve recently opened reader pledges with a goal of reaching 100 by May 31. So far we’ve on track, but I need your help to get there. All you have to do is pledge $5, so that I know I have your support for the moment I turn on paid subscriptions (and eventually try to make this work full time). Thank you to everyone who was kindly pledged so far. It means a lot.

Subscribe now

Stephen Grossberg in 2015 via Boston University

John Hopfield won the Nobel Prize for physics last year. Alongside Geoffrey Hinton, Hopfield was recognised for foundational contributions to machine learning that would ultimately make the likes of ChatGPT and Gemini possible. His great invention was the network that bears his name, whose development is generally considered to be an essential moment in the machine learning canon. Speaking to the awarding committee, he discussed how he first began to approach work on the systems:

You don’t leap into a problem overall saying, I want to understand how mind works. You have to build up from the bottom. If you were doing weather, you would say, well, I want to understand what storms are without going back to interacting air nitrogen molecules.

Analogies, in other words, are key. Hopfield’s 1982 paper is commonly credited with the invention of the ‘Hopfield network’, a type of recurrent neural network where all units are connected to each other. Typically used for pattern recognition and memory storage, the emergence of the the systems seemed to show that there was still life in the ‘connectionist’ branch of AI (the ancestor of deep learning). As researcher Tom Schwartz put it ‘Hopfield should be known as the fellow who brought neural nets back from the dead.’

Hopfield is no doubt a hugely important figure, but wasn’t the first to design and implement a fully connected network. Stanford researcher William Little had previously introduced versions of the networks in 1976, and so had Japanese neuroscientist Shun'ichi Amari in 1972.

The cognitive scientist Stephen Grossberg went further, arguing that he built the specific architecture that Hopfield made famous. As Grossberg, who first wrote about the ideas described in the Hopfield model in 1957, put it: ‘I don’t believe that this model should be named after Hopfield. He simply didn’t invent it. I did it when it was really a radical thing to do.’

But as every scientist knows, research needs rhetoric and papers need presentation. Hopfield removed chunks of mathematical descriptions in favour of compelling prose written for cognitive scientists. He published his paper in the influential Proceedings of the National Academy of Science and travelled extensively to talk about ‘his’ networks. These are the primary reasons that today we know the systems not as Grossberg networks, but as Hopfield networks.

One of the paper’s most influential ideas involved conceptualising the networks as ‘spin glasses’, a term derived from the magnetic state of matter. ‘Spin’ is a quantum property of particles that allows them to behave as miniscule magnets that can point in different directions. ‘Glass’ draws from an analogy with conventional glass, which is known for its irregular, amorphous structure at the atomic level. Atoms are arranged in a disordered state in a typical glassy material, unlike in crystal forms where atoms have a more regular arrangement.

In the Hopfield network, the system’s dynamics—how these states settle into patterns—are inspired by the way magnetic spins interact in physical systems. Hopfield’s work connected this idea, the potential for systems to transition from a disordered state to a stable one, to the concept of ‘associative memory’. Based loosely on the brain, the associative memory concept holds that if you give a system a piece of the desired output it can ‘remember’ the rest.

The connection was straightforward: just as a spin glass system transitions from a high energy state to a low energy state, a Hopfield network minimises its energy to represent and retrieve stored patterns. It was this analogy, which explained how a sea of small parts could settle into stable states representing stored memories, that encouraged the next generation of researchers to couple the operation of Hopfield networks with principles similar to those observed in physical systems.

Jack Cowen, an influential researcher who worked with Hopfield, understood the symbolic currency of the idea. He said ‘I think that's neat stuff, but I still think it's an artificial system, as is the antisymmetric one. It may have nothing to do with the way things really work in the nervous system, but it's a very interesting idea.’

By drawing a parallel between associative memory and the behaviour of physical systems like spin glasses, Hopfield reinforced the idea that complex cognitive processes could emerge from the collective behaviour of simple interacting units. This perspective supported the well-established notion that the brain’s higher-order cognitive functions could be explained through the interactions of simpler components.

As Hopfield explained: ‘Computational properties of use to biological organisms or to the construction of computers can emerge as collective properties of systems having a large number of simple equivalent components (or neurons).’ The result, according to this model, is that intelligence is an emergent property that may be produced through the interaction of smaller, simpler units. Faced with such a conclusion, it is perhaps unsurprising that the paper caught fire.

Toy models

To explain the functioning of the systems, Hopfield leaned heavily on the famous law coined by Canadian psychologist Donald Hebb: ‘When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased.’ You might know it as ‘neurons that fire together, wire together.’

Of course, Hopfield networks are not brains. And neither are they spin glasses. These types of comparisons are better thought of as ‘toy models’ that seek to help scientists understand the world. Frank Rosenblatt, on whose famous perceptron algorithm Hopfield aimed to build, stressed the importance of toy models to AI research in 1961:

‘The model is a simplified theoretical system, which purports to represent the laws and relationships which hold in the real physical universe…the model deliberately neglects certain complicating features of the natural phenomena under consideration, in order to obtain a more readily analyzed system, which will suggest basic principles that might be missed among the complexities of a more accurate representation.’

These models tell researchers about the problem they are trying to solve, but they also widen the epistemological field to accommodate new perspectives. Philosophers of science Knuttila and Loettgers, for example, have shown that mental models ‘provide modelers with a powerful cognitive strategy to transfer concepts, formal structures, and methods from one discipline to another’.

Through this process of analogical reasoning, the twin abstractions of spin glass modelling and neurophysiology offered a way of describing the functioning of the systems that was both persuasive and epistemically valuable. It may have won Hopfield a Nobel Prize, but that would have been small comfort to Stephen Grossberg.

Subscribe now

Emergence machines

Harry Law — Fri, 09 May 2025 08:39:40 GMT

The IBM 700 series.

Friday nights were game night for John Holland. Once every fortnight, Holland and a small group of young IBM researchers met to play Kriegspiel, poker, and Go. It was the summer of 1951 in the city of Poughkeepsie in New York State, and IBM was racing to build the first commercial programmed computer.

By day, the researchers worked to figure out how circuits, memory, and instruction sets worked together to support machine-language programming in the IBM 701. By night, when they weren’t trading cards, they tested it. Some were showing the machine how to play checkers. Holland, fresh from an undergraduate at MIT, was teaching it to learn.

John Holland grew up amongst the quiet plains and soy‑bean factories of rural Indiana. Born in 1929 to a businessman father and an adventurous mother who learned to fly in her forties, Holland was encouraged to try anything that took his fancy.

A chemistry set sparked his early love of science, and the young Holland eventually found his way to MIT. There, as an undergraduate physics major, he undertook his bachelor’s thesis on Whirlwind, the university’s digital computer built for missile detection.

After MIT, Holland took a job at IBM’s Poughkeepsie lab in 1950. There he joined the small group building the IBM 701, the company’s first commercial electronic computer. Nicknamed the Defense Calculator, the 701 was a room-sized behemoth made up of tubes and magnetic drums.

By 1959 Holland had earned what was essentially Michigan’s first computer-science PhD, with a dissertation studying feedback loops in early neural networks. Spurred by his encounters with IBM 701 and buoyed by his academic work, Holland said he began to think ‘about genetics and adaptive systems’.

A key moment came in 1955 when Holland stumbled on a book called On The Genetical Theory of Natural Selection written by Ronald Fisher. Fisher, generally regarded as the great figure in the history of statistics, set out to show how laws of inheritance could underpin natural selection. He introduced the idea that evolution is about changes in gene frequencies within populations, and formulated the famous ‘fundamental theorem of natural selection,’ which roughly states that a population can only get fitter as fast as there’s useful genetic variety to work with.

In Fisher’s equations Holland saw a template for an algorithm that could encode potential solutions as ‘genes’ to let the fitter ones reproduce and occasionally mutate. The combination of mutation and reward meant he just needed to define a target and watch the algorithm adapt its way to an optimal solution.

In practice, that meant treating possible solutions like digital organisms. The system evaluated each string of code according to how well it performed, then selected the stronger ones to combine and reproduce. Their code was mixed, mutated, and eventually used to create a new generation of candidate solutions.

The process has an advantage over simpler search methods. Hill climbing algorithms improve a single solution piece by piece but get stuck on local peaks. It’s easy to produce a good answer but hard to produce a great one. Genetic algorithms avoid this fate by maintaining a diverse population of solutions. Through mutation and recombination, they can occasionally make bold jumps into less promising territory that turn out to lead somewhere better.

Holland published the first comprehensive account of genetic algorithms in his 1975 book Adaptation in Natural and Artificial Systems. The work aimed to lay a general ‘mathematical theory of adaptation’ that used genetic algorithms as a central tool to solve complex optimisation problems.

Genetic algorithms are good at things like shaping an aircraft wing, designing an electronic circuit, or routing a network. They work well for tasks for which you can’t easily design a solution from first principles, but you can test how well any given attempt performs.

In machine learning, genetic algorithms are used when there’s no clear way to calculate the best setup (like deciding which features to include in a model or tuning parameters that control how it learns). If you can’t derive the answer but you can run it and see how well it works, it’s a problem that genetic algorithms might be good at.

Emergence

Romanesco broccoli displays fractal geometry with a logarithmic spiral arrangement that follows the Fibonacci sequence. Its complex structure emerges from simple cellular growth rules.

Holland extended his framework to what he called Learning Classifier Systems (LCS) in the 1980s. These were evolving collections of simple if-then rules that compete, cooperate, and adapt based on feedback. Instead of solving a fixed problem, they learn how to behave in changing environments. One system learned to play checkers. Another optimised industrial pipeline flow in real time. The idea is that they weren’t being pre-programmed, but they could learn which rules worked and reinforce them over time.

It was a curious marriage. The systems used if-then rules, the bread and butter of symbolic AI, but fused them with learning — the hallmark of connectionist models like neural networks. In most of the AI industry at the time, these approaches were seen as opposites (and for the most part still are today). Rules are explicit, logical, and human-readable. Learning is messy, statistical, and ambiguous.

But Holland didn’t see a contradiction. He saw a system where rules could be treated like genes. They were hypotheses to be tested, combined, and evolved. Each rule was part of a population and scored according to how well it performed. Useful ones were strengthened and reused, while weaker ones were discarded and replaced by new variations.

In later years Holland became a leading voice in the science of complex adaptive systems. He co-founded the Santa Fe Institute, an interdisciplinary hub for studying complexity. His philosophy was consistent: intelligence emerges from the interaction of many simple parts, not from a single clever algorithm.

Holland viewed any group of interacting, adaptive entities—be it neurons in a brain, ants in a colony, or humans in a city—as essentially the same phenomenon. Each a form of computation with emergent collective behaviour. In his book Emergence, for example, he described how brain functions or market behaviour could not be understood by simply summing up individual units; rather, the nonlinear interactions made the aggregate far more complex than the parts.

Genetic algorithms represent a philosophy of intelligence. Holland believed that intelligence was an emergent property that bubbled up from competition. That idea is now at the core of the most successful approaches to AI. While many may not use genetic algorithms, they do embody Holland’s deeper lesson: intelligence emerges from systems that change.

Subscribe now

The Neuron Doctrine

Harry Law — Sat, 03 May 2025 11:26:12 GMT

Today’s post is the first entry in a new series looking at important moments in the history of AI and the fields that influenced it. These pieces will be shorter than the weekly essays and will land in your inbox every Friday.

This week also marks the start of reader pledges. I’m testing the waters to see whether there’s a path towards working on this newsletter full time, with a goal of 100 pledges by May 31. The response has been even better than I had hoped, so thank you to everyone who has kindly pledged so far. For anyone considering supporting this project, this is the best moment to help it grow.

Subscribe now

Portrait Of Santiago Ramón y Cajal by Joaquin Sorolla y Bastida (detail).

Ramón y Cajal loved to sketch.

As a boy, he was gripped by a self-described ‘irresistible mania’ to draw. Whenever he saw a white wall, Cajal had no choice but to scribble on it — much to the chagrin of his parents who considered young Ramon’s interests in the arts to be something of an inconvenience.

The Spaniard used to cobble together enough pennies to buy art supplies and sneak out of the house. Sitting on a bank at the side of the road, he drew carts, horses, villagers, and anything else that took his fancy.

Born in 1852 in Petilla de Aragón in northeastern Spain, he grew up in a country still grappling with the aftershock of the Napoleonic Wars that dominated the opening decades of the century.

Napoleon famously called the Iberian campaign the ‘Spanish ulcer’. It was a wound that festered long after the French withdrawal, one that only showed signs of healing in the wake of the ‘Glorious Revolution’ of 1868 which ended the tumultuous reign of Queen Isabel II.

Spanish science picked up just as Cajal began to get a taste for medical practice thanks to his father. The senior Ramón was a barber-surgeon. It was a profession that, since medieval times, tended to be the natural choice for minor surgeries and bloodletting.

Over his father’s shoulder, Cajal learned anatomy through dissecting and drawing cadavers. It was a grisly experience, but one that put him on the path to a medical career.

Following a stint as an army doctor that included a tour during Cuba’s first war of independence, the surgeon returned to Spain in 1877 to pick up a doctorate from the Complutense University of Madrid.

After a decade that included a bout with a nasty strain of tuberculosis and entry into the Caballeros de la Noche Masonic lodge, Cajal moved to the Faculty of Medicine at the University of Barcelona.

In Catalonia, the Spaniard first saw early plates of neurons treated with colouring compounds that looked like ‘Chinese ink on transparent Japanese paper.’ What had been an ‘inextricable network’ when stained with carmine and hematoxylin became ‘simple, clear, and unconfused.’

Gripped by what he saw, Cajal did what he liked most of all: he drew.

In May 1888 he used the staining process and his sketch pad to argue that brain tissue was not composed of continuous connections. This idea, which rejected the view favoured by Italian physician Camillo Golgi, relied on Cajal’s ability to observe discrete neurons rather than nerves and brain tissue.

When Golgi used the cell staining technique, he interpreted the black-reaction silhouettes as evidence that neuronal processes fuse into a single protoplasmic web—a ‘reticulum’ of continuity—rather than remaining distinct cells that happen to touch. But Cajal, who tweaked the recipe to use younger brain tissue, noticed tiny breaks between the cells.

The disagreement was bad tempered. When both Cajal and Golgi received a joint Nobel Prize in 1906, Golgi used the spectacle of the moment to drum up support for his position.

History sided with Cajal. Later confirmed by electron microscopy, the neuron doctrine eventually became the foundation of modern neuroscience.

The neuron doctrine in AI

Drawing by Santiago Ramón y Cajal.

AI is not neuroscience, and artificial neural networks are not brains. But that doesn’t mean that, throughout history, AI researchers haven’t looked to neuroscience for inspiration to make sense of their creations.

Cajal’s work allowed Nicolas Rashevsky, a Russian-American biologist who was instrumental in creating the field of mathematical biology, to establish a mathematical foundation for understanding neurons.

Throughout the 1930s, he developed equations to articulate the processes by which neurons interact with each other at the University of Chicago, describing the functioning of nerve cells in the language of mathematics.

Using mathematical terms was a watershed moment: if real neurons could be described through equations, it followed that versions of their biological processes might be represented in artificial structures.

That moment came in 1943 with the introduction of the McCulloch-Pitts neuron, which the pair developed by directly building on Rashevsky’s work. In their blockbuster paper, ‘A Logical Calculus of the Ideas Immanent in Nervous Activity,’ Warren McCulloch and Walter Pitts (a student of Rashevsky) created a mathematical model that acted as a simplified abstraction of a biological neuron.

These ideas eventually led to the development of Frank Rosenblatt’s famous perceptron algorithm, the rise of parallel distributed processing techniques in the 1980s, and ultimately the giant neural nets that make ChatGPT, Claude, and Gemini tick.

But Cajal’s influence on the development of artificial neural networks wasn’t only indirect. Even in the 1980s, prominent research groups were publishing AI papers complete with Cajal’s drawings. Writing in 1986, researchers from Bell Labs explained that:

‘circuit considerations will probably limit digital computer cycle times to a few tenths of a nanosecond, speeds that are within an order of magnitude of today's fastest machines. The way to further increase computing power is through parallel computation.’

To make their case, the authors reproduced a rendering of a neuron drawn by Cajal. They reasoned that neurons receive input signals from other neurons through their dendrites, sum those signals together (weighing each based on the strength of its synaptic connections), and fire an output signal whose strength is determined by these input signals.

Such outputs cause the neuron's firing rate to increase proportionally for moderate input sums but saturate at maximum or minimum levels for extremely high or extremely low sums. When bundled into very large networks, the researchers argued, it is this computational structure that provides the complex information processing capabilities of the brain.

Diagram of a neuron, originally drawn by Ramón y Cajal, reprinted in the Bell Labs paper ‘Electronic Neural Computing’ in 1986.

The weight of evidence eventually proved Cajal right, but he also told the best story by painting the best picture. The ink lines in his notebooks were persuading colleagues long before electron beams could.

Cajal reminds us that discovery is not just a matter of seeing but of showing. All things being equal, the scientist who can make plain truths pretty tends to be the one that wins.

Subscribe now