Unnatural Selection

AI Histories #16: Ronald Fisher, statistics, and genetics

Aug 14, 2025

In the early 1920s, Ronald Fisher put eight porcelain cups on a garden table at Rothamsted agricultural research station in Hertfordshire. Four had the milk poured first and four had the tea poured first. Muriel Bristol, a biologist who insisted she could taste the difference, sipped and sorted while Fisher looked on. She called all eight correctly.

Fisher knew that chance alone would yield that score about once in seventy tries, but he also knew that — if her success wasn’t due to chance — it had to be tested under conditions that removed hidden patterns in the set-up. So long as the cups are random and the observations accurate, you could in principle formalise your approach as a series of steps to follow. We might call it an algorithm.

Two decades later, AI grandee Arthur Samuel borrowed the same idea for his checkers program. He let the computer occasionally play random moves in the opening, giving it clean, unbiased samples of board positions before it started learning from them. It’s a core idea behind even the biggest and best machine learning systems, one that lets them see enough of the world to hoover up the right kinds of patterns.

Randomisation, formalised

In The Design of Experiments, published in 1935, Fisher described the rule: if you’re going to compare two treatments, you must assign them to plots at random. Not roughly evenly and not by rotation. Randomly. Because if you don’t assign things at random, you can’t tell whether the result is due to the treatment or something else you didn’t control.

Maybe one side of the field gets more sun. Maybe the soil is drier in one patch than another. Maybe the experimenter gives a bit more attention to the first group, or unconsciously expects it to do better. Randomisation makes sure that any other differences are spread evenly between groups. That way, if you do see a difference in outcome, you can be more confident it came from the treatment rather than from something else you didn’t account for.

Fisher’s method for testing whether a treatment made a difference — what we now call a significance test — depends on knowing how likely each outcome was, assuming the treatment had no effect. But you can only know that if the treatments were assigned by chance. Without that, there’s no fixed set of possibilities to compare your result against.

In this sense, randomisation is the element that makes the test possible. When engineers built systems that experimented on themselves, they copied the structure Fisher had laid down. Randomise the action, observe the outcome, and ask if the difference was larger than chance. Even today, that is the basic logic that lets machines learn by trial and error without fooling themselves.

In 1922, Fisher published a paper that reshaped how statistics was done. Up to that point, most estimates came from algebraic convenience or common sense. Fisher replaced both with another rule that said if you want to estimate an unknown value in your model, choose the one that makes the observed data most likely. That rule became known as maximum likelihood.

Maximum likelihood defined a way of thinking where you take a model, plug in the data, and read off which version of the model fits best. That principle now sits under almost every statistical model in AI. Classifiers, regressors, language models are all trained by adjusting parameters to maximise likelihood, or minimise its negative log. That’s what people mean when they talk about minimising a loss function, whose roots we discussed in more detail in AI Histories #10.

The same paper introduced something he called the information of a parameter, which measured how sharply the likelihood function peaked around the best guess. A steep peak meant high confidence while a flat one meant you weren’t learning much. I won’t say much about this point, but it turned out to be an important mathematical object in machine learning that we now refer to as the Fisher information matrix.

A few years later in 1930, Fisher published The Genetical Theory of Natural Selection. It was a dense, mathematical book whose key idea was that the rate at which a population’s average fitness improves is equal to the amount of genetic variance in fitness it holds.

He built models to show what that looked like over time. Around the same moment, the American geneticist Sewall Wright was developing a parallel description of drift. This Wright–Fisher model captures how allele frequencies change across generations due to selection, mutation, and random drift. The model was meant for biology, but it also became the blueprint for genetic algorithms that we looked at in AI Histories #2.

Fisher’s theorem said that progress depends on maintaining variance, but the Wright–Fisher model showed how quickly variance disappears. That’s still a core challenge in evolutionary computation: how to keep exploring long enough to find something new, without getting stuck on the same hill forever.

In 1936, Fisher took measurements from three species of iris — petal length, sepal width, and so on — and asked whether the species could be separated based on those numbers alone. The method he used became known as ‘linear discriminant analysis’ or LDA.

The idea was to find one (for two classes) direction through the data that kept each species tightly grouped, while pushing the groups as far apart as possible. You begin by taking your raw measurements, projecting them onto a line, and checking which side the new point fell on.

By the 1950s and 1960s, LDA was well-known to many of the new pattern recognition groups at Bell, MIT, and the Lincoln Lab. Researchers used it to classify phonemes, radar blips, and handwriting. In Duda and Hart’s 1973 textbook, which was something like a holy text for connectionist researchers well into the 1980s, it’s the first real classifier discussed.

Drawing a line

In 1933, Ronald Fisher was appointed to the Galton Chair of Eugenics at University College London. He had already spent a decade arguing that Britain’s falling birth rates were a threat to ‘national fitness,’ and that differential reproduction across social classes would lead to civilisational decline. As late as the 1950s, he was still writing letters defending sterilisation policies and publishing essays warning of social degeneration.

Fisher thought statistics was relevant to politics, and the models he built in genetics — about selection, fitness, and variance — fed into the arguments he made about society. He believed that mathematical structures could uncover the natural order of things, and that once uncovered, they ought to be preserved.

As head of the Galton Laboratory, he helped steer British research into human heredity through the middle of the 20th century. Some of the datasets, measurement protocols, and study designs he left behind were later used to support claims about intelligence and class.

But his work has been enormously influential in many other less controversial areas. When researchers study algorithmic bias today, for example, they draw on the same theoretical foundations Fisher developed. Fairness audits use his work to measure whether an outcome is evenly distributed across groups, and significance thresholds still rest on the logic of his null-hypothesis framework.

Some of Fisher’s ideas are deeply disagreeable, but others are foundational to scientific practice. They live on in ways he never could have imagined, often in pursuit of goals he might have opposed. The lesson, if there is one to be had, is not that technology is neutral or that it is hopelessly corruptible. In fact, it is technology’s value-laden nature that lets us scrutinise it, shape it, and put it to work in a way that is commensurate with our own belief systems.

Learning From Examples

Discussion about this post