In March 1995 in Holmdel, New Jersey, three men put pen to paper on a wager. Larry Jackel speculated that the inner workings of neural networks would be revealed by the year 2000. Uzbek mathematical theorist Vladimir Vapnik forecast that ‘no one in his right mind will use neural nets that are essentially like those used in 1995.’ Yann LeCun provided the signature on both bets as an ‘official’ observer.
All three worked at the Adaptive Systems Research Department, an influential machine learning outfit at Bell Labs. As for the bets, received wisdom holds that Jackel and Vapnik were both wrong. The millennium came and went, while a comprehensive understanding of the internal processes of artificial neural networks continued to elude researchers.
Today, we know that it didn’t matter. Neural networks are king, and we still don’t really understand everything about how they work. The systems had made huge strides since the days of the single layer perceptron (AI Histories #7), the emergence of the Hopfield network (AI Histories #3), and the popularisation of backpropagation in the 1980s (AI Histories #6), but their stratospheric rise was no sure thing in the 1990s.
One promising alternative was the support vector machine (SVM). Developed by Isabelle Guyon, Bernard Boser, and Vladimir Vapnik in the early years of the 1990s, the system promised a way to identify a reliable boundary between categories that could generalise well to new data.
Where neural networks learn by trial and adjustment, SVMs solve for the single optimal boundary from the start. Where the former stresses flexibility and scale, the latter is focused on precision, stability, and mathematical guarantees. It was an attractive combination, one that offered a mixture of reliability and interpretability at a moment when most learning systems were highly opaque and unstable.
From Russia with love
Vladimir Vapnik was a Soviet statistician whose relocation to Bell Labs in 1991 brought statistical learning theory into contact with American engineering practice. Born in 1936 in Tashkent, then part of the Uzbek Soviet Socialist Republic, Vapnik was educated at Uzbek State University before studying under the cybernetician Aleksandr Lerner at the Moscow Institute of Control Sciences.
He entered Bell Labs as a mature scholar whose intellectual formation had taken place almost entirely within the Soviet academy. At the Moscow Institute of Control Sciences, his collaborations with Alexey Chervonenkis had produced Vapnik–Chervonenkis (VC) theory, a mathematical framework for analysing the conditions under which models generalise from sample data to unseen cases.
The Moscow Institute of Control Sciences, founded in 1939 under the auspices of the Soviet Academy of Sciences, was the country’s principal centre for research in cybernetics, automation, and systems theory. By the 1960s it had become a hub for work on “automatic recognition,” the effort to design algorithms that could classify signals and images without human supervision. As Emmanuil Braverman, one of the researchers at the group put it in 1966, its staff were concerned with the “problem of teaching the machine image recognition without teacher”.
The capacity of a model refers to its ability to fit a variety of functions, with a high-capacity model capable of fitting complex patterns and a low-capacity model better suited to fitting simpler patterns. The VC dimension measures capacity by identifying the largest number of points that a model can ‘shatter’ or classify according to every possible way of labelling them. Despite the knotty mathematics involved, the central insight behind VC theory was a simple one: a model’s ability to generalise is related to its complexity, not just its performance on training data.
In this framing, capacity control pointed towards a conception of machine learning as a principled act of discovering the right predicates such as invariances, symmetries, or structural constraints. It was a philosophy that set his work apart from the empirical pragmatism of American engineering, a school of thought that emphasised the adjustment of algorithms and the accumulation of heuristics rather than a search for universal principles.
Vapnik distrusted the interpretive shortcuts so common in applied work, later likening them to Antonie van Leeuwenhoek’s descriptions of blood cells as ‘armies’ fighting under a microscope: ‘He [Leeuwenhoek] saw something. Yes. But he gave wrong interpretation.’ In Vapnik’s eyes, only mathematics could disclose reality without distortion. In this sense, the Uzbek carried into Holmdel the last great geometric faith of the Soviet control sciences. It was austere, axiomatic, and estranged from American pragmatism, but waiting to be made real.
Towards support vector machines
At its core, an SVM is a type of machine learning model that works by drawing the best possible line (or ‘hyperplane’) between different groups of data points. Support vector machines do not consist of multiple layers or nodes. And they do not rely on the process of ‘non-convex optimisation’ through which a neural network adjusts to fit the data, like finding the lowest point in a landscape with many hills and valleys. Instead, support vector machines find the boundary between groups of data in a manner analogous to finding the bottom of a single, smooth bowl.
For Vapnik, this geometric purity was the SVM’s defining virtue. Where neural networks relied on non-convex optimisation, SVMs posed a single convex problem with a unique global solution. In his later reflections, Vapnik was blunt about the contrast. With “deep learning,” he remarked, “they invent something. And then they try to prove advantage of that through interpretations, which [are] mostly wrong.”
By the early 1990s Bell Labs had its own settled approach to learning problems, one that flowed from a Western statistical tradition developed independently of the Soviet school. Where the Western tradition valued empirical pragmatism, the Soviet school valued axiomatic formalism. Where the former stressed empirical benchmarks and engineering heuristics, the latter favoured mathematically principled formalisms.
The Western tradition begins with Ronald Fisher in the 1930s, who showed how to classify data by drawing a line — a linear discriminant — that best separates groups. That gave the field its basic geometry by separating surfaces as a way to reason about data, which we discussed in AI Histories #16. After the war, Abraham Wald (the star of AI Histories #10) reframed inference as decision-making under uncertainty. His “decision theory” treated statistics as a dynamic process of minimising risk.
Through the 1960s–70s, this optimisation-driven approach was absorbed by engineers tackling pattern recognition and signal processing. Researchers like Thomas Cover and O.L. Mangasarian cast classification as a solvable optimisation problem. Duda and Hart’s famous 1973 Pattern Classification and Scene Analysis codified the Western field’s pragmatic style by laying out a toolbox of methods.
By the early 1990s, most of the technical contingencies were in place for the development of support vector machines: statistical learning theory and the VC dimension to control model capacity, optimal margin algorithms for finding decision boundaries, and methods to estimate relationships between variables based on random samples of data points.
When Vapnik arrived from Moscow in 1991, he entered a culture shaped by risk, optimisation, and empirical testing rather than axioms. It was a tradition that prized what worked over what could be proved, one that would later fuse with Soviet-style formalism through kernel methods — a class of algorithms that use kernel functions to operate in high-dimensional feature spaces without explicitly computing coordinates.
A kernel is a similarity function between two data points that satisfies certain mathematical properties, allowing it to represent inner products (functions that take two vectors and return a single number) in high dimensional space. In 1950 Polish-American mathematician Nachman Aronszajn established fundamental properties of reproducing kernels that would ultimately allow this mapping process to take place.
The kernel trick
By 1992, Vapnik was working with Isabelle Guyon to translate the abstractions of statistical learning theory into a classifier with measurable generalisation performance. The challenge was how to build a system that could learn from examples (clue klaxon) without either overfitting — memorising the training set so well that it failed on new cases — or underfitting, failing to learn enough to solve the task at all.
In essence, the problem was how to regulate the “capacity” of a model so that it captured just enough structure to generalise beyond its training examples. The kernel trick — rediscovered in Holmdel through the collaboration of Guyon, Vapnik, and Bernhard Boser — supplied the missing piece. It allowed the abstract guarantees of statistical learning theory to be embodied in a classifier that engineers could use, providing a bridge between the theoretical space of capacity control and the empirical world of pattern recognition.
A central idea behind the kernel trick is the concept of duality, which shows how different types of classifiers can be viewed as ‘dual representations’ of the same decision function. The principle means that the same classification problem can be represented in ‘primal space’ or in ‘dual space’. This idea is important insofar as it allows the algorithm to switch between primal space and dual space depending on whichever is more computationally efficient for a given problem.
The moment that support vector machines were developed looms large in ML mythology. It begins with Bernhard Boser’s decision to leave Bell Labs in 1991 for a position at UC Berkeley. Boser, a hardware designer, was unable to start a new project in the intervening months between concluding his work at Bell Labs and beginning a new position in California. Instead, he chose to implement an algorithm from Vapnik, developed in the 1960s, which sought to find the best boundary that separates different groups of data points.
Once complete, Vapnik proposed making the algorithm ‘nonlinear’ to enable the model to deal with distributed data points that cannot be separated well with a straight line. But where Vapnik advocated to solve this problem using a ‘polynomial’ approach, Guyon had a different idea. Instead of explicitly creating new polynomial features, Guyon proposed using the ‘kernel trick’ based on work by Duda and Hart (and described independently by the trio of Aizerman, Braverman, and Rozonoer in Russia).
It was this approach that led to the emergence of the support vector machine as it is commonly understood today. Guyon, Boser, and Vapnik published details of the kernelised algorithm at the Fifth Annual Workshop on Computational Learning Theory (COLT ’92). Reflecting in 2016 on the development of the support vector machine, Guyon described an initial hesitance on the part of Vapnik due to the origins of the potential function algorithm from the group at the Moscow Institute for Control Sciences:
“After some initial success of the linear algorithm, Vladimir suggested introducing products of features. I proposed to rather use the kernel trick of the ‘potential function’ algorithm. Vladimir initially resisted the idea because the inventors of the ‘potential functions’ algorithm (Aizerman, Braverman, and Rozonoer) were from a competing team of his institute back in the 1960’s in Russia! But Bernhard tried it anyways, and the SVMs were born!”.
The emergence of the support vector machine marked the culmination of decades of theoretical and practical advancements in pattern recognition, statistical learning theory, and optimisation techniques. At Bell Labs, the collaboration between Boser, Vapnik, and Guyon brought these disparate threads together. Boser’s implementation of Vapnik’s optimal margin algorithm provided a starting point, while Vapnik’s proposal to add nonlinearity sought to address the challenge of complex data distributions.
Ways of learning
What Boser, Vapnik, and Guyon achieved in Holmdel was the blending of two intellectual cultures. From the Soviet side came the abstractions of VC theory and structural risk minimisation, with their insistence on general principles and theoretical bounds. From the Western side came a tradition of pattern recognition rooted in empirical performance, approximation methods, and the willingness to bend mathematics to fit messy data.
The result was a machine that embodied the theoretical guarantees of convex optimisation and margin maximisation coexisting with the practical imperatives of implementation and performance. Within Bell Labs’ institutional culture, this interaction demonstrated that ideas forged in the high formalism of the Soviet control sciences could be translated into efficient tools for American industry.
The support vector machine represented the point at which theory and practice, abstraction and application, converged in code in the offices of New Jersey. Its development marked the closing chapter of a geometric conception of intelligence that had defined the twentieth century, one that imagined learning as the discovery of stable forms and separating surfaces in high-dimensional space.
This is why Vapnik bet Yann LeCun that artificial neural networks were a dead end. The wager, a bit of fun but entirely sincere, expressed divergent conceptions of what “learning” meant. For LeCun, intelligence was a matter of distributed adaptation: systems that adjusted their weights through experience until useful representations emerged. For Vapnik, it was an exercise in geometry and proof, which is why he explained his approach stressed finding and formalising axioms.
While deep learning eventually proved triumphant, the split made visible the moment when the field’s centre of gravity shifted from the geometric to the statistical, from global optima to local gradients, from the certainty of separability to the fluidity of representation. The bet is a hinge in the history of artificial intelligence, a moment that divided an older tradition of mathematical certainty from a new era defined by probabilistic depth and empirical abundance.