Lies, damn lies, and statistics

AI Histories #10: Abraham Wald and the origins of machine learning

Jul 03, 2025

*Ascent in a Montgolfier Balloon in Aranjuez* by Antonio Carnicero from 1784

If you’ve spent enough time on X, you’ve probably seen a picture of a plane riddled with red dots. Usually, it gets wheeled out to poke fun at someone for slipping on one of the online’s favourite banana skins: paying attention to something that made it through a process while forgetting to ask what happened to the things that didn’t.

This ‘survivorship bias’ meme begins in the Second World War, when Allied statisticians studied aircraft returning from combat. Most had bullet holes in the wings and fuselage, with the engines conspicuously unscathed. The obvious solution was to reinforce the damaged areas of returning aircraft to protect them in the future.

Not everyone agreed with the proposed approach. We are told that the Hungarian mathematician Abraham Wald thought it better to armour the parts without bullet holes, inferring that those were the shots that had likely brought the other planes down.

It’s a good story but the truth is messier.

Wald did work on aircraft survivability at Columbia’s Statistical Research Group, and he did help correct for missing data in the military’s analysis. But his contribution was research project rather than eureka moment. Over the space of a few weeks, he drafted a memo that corrected for missing data from planes that never returned and balanced statistical inference against the practical limits of aircraft drag and weight.

The effort was the product of the whole group at Columbia, where mathematicians, economists, and engineers went ten rounds with the results until they were happy with their conclusion. It’s an important distinction to make because it reminds us that the work is connected to a deeper intellectual legacy that we are still wrestling with today.

The punchline everyone remembers is used to illustrate the error of reasoning from what’s visible while forgetting what’s missing. But that wasn’t really Wald’s point. His work was more concerned with showing how to make good decisions when you don’t have all the data, and for choosing actions that minimise the cost of being wrong.

Profit and loss

Born in 1902 in what was then Austria-Hungary, Wald trained as a mathematician in Vienna with a generation of thinkers who were trying to formalise logic. In his 30s, he was forced to flee to the United States in the aftermath of the Nazi annexation of Austria.

Wald soon joined the Statistical Research Group at Columbia, a classified wartime think tank set up in 1942 where academics including Milton Friedman and George Stigler turned probability theory into military advantage.

The work was important but foggy. How should the Navy test the quality of munitions without wasting shells? How many samples were enough to catch defects in equipment? And of course how could you predict which parts of a bomber ought to be reinforced?

Wald’s response to these questions was to treat every problem as a matter of risk, cost, and incomplete knowledge. His big idea was simple enough: if you can’t eliminate uncertainty, optimise your decision by minimising your expected loss. In other words, weigh the possible mistakes you could make and choose the option that’s least likely to cause trouble. This became the core of what he eventually called statistical decision theory, which we can think about as betting wisely when we don’t know the odds.

In 1945, with the war winding up, Wald published a technical report about how to make decisions under uncertainty when classifying something into groups (say, whether a signal is from a friend or foe).

Crucially, it accounted for cases where the available evidence is uncertain and the cost of misclassification differs depending on the mistake. His solution was to choose the option that minimises the expected loss. You do that by considering all the possible ways you could be wrong, figuring out how likely each one is, and asking how much each would cost you.

Once you’ve done that, you pick the option with the lowest overall risk.

Wald’s move was to treat classification as a decision problem under uncertainty. He showed that if you knew the approximate distributions of the two groups and the cost of each type of error, then you could calculate the best way to label a new observation.

After the war, Wald’s decision theory approach to classification was taken up by researchers across statistics and engineering. One direct successor was a 1951 paper by Evelyn Fix and Joseph Hodges at Berkeley, which framed pattern classification as a statistical task. They wondered how best to assign a new observation to one of several known classes given only a sample of labelled data.

The field they were building would eventually be known as pattern recognition and became a small but serious research community by the middle of the 20th century. As the area matured, attention shifted from hand-crafted rules to models that could learn those rules from data. That question, how to let the data determine its own decision boundary, sat at the heart of the nascent discipline and ultimately machine learning.

By the 1970s the pattern recognition crowd formalised Wald’s insight into what they called ‘empirical-risk minimisation’ where you pick the rule that makes the smallest average mistake on the data you have. Soviet theorists Vapnik and Chervonenkis famously used this idea to show how well any classifier trained on finite data can be expected to generalise.

Around the same time, the doctrine of loss minimisation found its engine. In 1974 Paul Werbos described backpropagation, the calculus trick for computing how every weight in a layered network should change to reduce a chosen loss that we discussed in AI Histories #6.

When Rumelhart, Hinton and Williams reintroduced it in 1986 they gave neural networks a practical way to compute the loss gradient for every weight. In doing so, they turned Wald’s ‘minimise expected loss’ idea into an optimisation procedure for models with thousands (and eventually trillions) of parameters.

Wald is a curious figure in the history of thinking machines. Outside a handful of exceptions, his work is rarely discussed in the same context as the AI project that we know today. He didn’t write code, didn’t talk about consciousness, and didn’t speculate about living beside machines smarter than we are.

What he did do was lay down the logic that modern AI still follows, the stuff that deals with how to make decisions when the data is noisy and the outcome matters. But when the plane meme next appears on your timeline, remember the same logic that kept B-17 crews alive now guides the systems that millions of people use every single day.

Learning From Examples

Discussion about this post