If you can't stand the heat

AI Histories #14: Boltzmann machines

Jul 31, 2025

Smarthistory – Luca Signorelli, The Damned Cast into Hell — *The Damned Cast into Hell* by Luca Signorelli from 1499–1502

By the second half of the nineteenth century, physicists knew that energy tended to even out. Hot things liked to cool down and gases expanded to fill the space they were in. Formalised as the second law of thermodynamics, this idea holds that a closed system’s entropy (often described as ‘disorder’) keeps rising as its energy spreads out.

That sounds like a force of nature, but it’s better reckoned with as a way of characterising how systems behave when left to their own devices. If the world looks orderly to us, that’s just because we’re experiencing unlikely but possible states bubble up before they disappear.

At the core of this observation is Boltzmann distribution, which gives the probability of a system occupying a state as a function of that state’s energy. Described by the Austrian physicist Ludwig Boltzmann in the 19th century, the idea put forward that low energy states are more likely, and high energy ones become rarer as a system cools. Because rare states happen more often at higher temperatures, systems become more dynamic as heat increases.

What matters here is the claim that randomness has structure. That if you can’t follow every molecule in a glass of water, you could still know what kinds of configurations were likely. Put another way, the Boltzmann distribution is a way of thinking about systems in terms of tendencies rather than rules.

Spin glasses

A spin glass is a material made of minuscule magnetic units, called spins, which each act like a tiny compass that point either up or down. In most magnets, the spins tend to align with each other, which creates a strong overall magnetic field. In spin glasses, the spins are influenced by conflicting forces. Some want to align but others want to point in opposite directions, so there's no arrangement that satisfies all of them at once.

The result is magnetic deadlock where the spins get stuck in a disordered pattern with no clear overall direction. Our system becomes stable but messy, trapped somewhere between maximally ordered and chaotic states. We describe the specific arrangements of individual spins held as ‘local energy minima,’ a term familiar to anyone who knows about the operation of connectionist AI systems like neural networks.

Spin glasses neither collapse into randomness nor configure themselves into symmetrical states. They get stuck, but in a way that we can predict. For many scientists, this made them rich research subjects in their own right; for others, the idea of a system that stabilises without fully resolving reminded them of other natural phenomena.

One particularly resonant comparison came in 1982, when John Hopfield proposed a simple network of binary units each connected symmetrically to the others. The idea was that the Hopfield network could store and retrieve patterns by settling into multiple stable states, each of which corresponded to a memory. Rather than being guided by an external controller, it would recall what it had ‘seen’ by letting its internal dynamics find a familiar configuration.

That’s the core of the ‘associative memory’ idea behind the system, which describes a gradual adjustment until it lands in a configuration that best matches the input. A partial or noisy signal activates the system, and the network completes the pattern automatically.

Hopfield didn’t claim this was how the brain actually worked, but he did show that you could treat a pattern recall problem like a physical relaxation problem. What had been a question about cognition became a question about finding low points in a landscape. In doing so it offered a different model of intelligence, one that brought the models of statistical physics into the world of computation.

Hopfield’s networks were clever but static. The architecture could store patterns, but the rules for how to update the weights were limited and biologically implausible. You could tweak the weights to embed a few memories, but you couldn’t easily make the system learn from data.

In 1985, Geoffrey Hinton, David Ackley, and Terry Sejnowski added noise to the Hopfield network. Instead of flipping deterministically into a new state, each unit in the network would switch on or off with a probability that followed the Boltzmann distribution. High energy states were unlikely and low energy ones were preferred. But now, unlike in Hopfield’s model, the system could find its way out of a local minimum if the temperature was high enough.

They called it the ‘Boltzmann machine,’ and it used a slow but elegant learning rule to update the internal state of the system. First, you connected the visible units to the data and let the hidden units adjust. Then you unclamped the system and let it run freely. You compared the two distributions — how often different configurations showed up in each phase — and used the weights to reduce the gap. The goal was to make the model’s internal world reflect the structure of the real one.

For a moment, it looked like Boltzmann machines might turn the field on its head. They had the ring of generality in that they were learning to understand the distribution those things came from. In a discipline still recovering from the failure of expert systems, that was an intoxicating promise.

Alas, they had some problems. Training full Boltzmann machines was slow and sampling took forever. You needed to reach equilibrium just to take a gradient step, and each new data point meant starting the process again. It was an elegant theory that couldn’t scale in reality, at least until Hinton found a workaround in 2002.

In this version of the Boltzmann machine, units within the same layer were prevented from communicating. Only visible-to-hidden links remained, which stripped out the feedback loops and made sampling easier. Instead of full equilibrium, you took only enough steps to approximate the gradient in a process called ‘contrastive divergence’.

Stack a few of these ‘restricted’ Boltzmann machines on top of each other and you got what researchers term a ‘deep belief network’ where each layer learned to represent the structure of the one below it. In 2006, in one of the first concrete demonstrations that deep learning could work, Hinton and his collaborators showed it could achieve decent results on tasks like digit recognition.

This signal primed the field before convolutional neural networks were retooled for the era of large datasets and GPUs just a few years later. So in 2012, when AlexNet famously proved just how powerful massive neural networks could be, researchers were quick to recognise it as the moment that the deep learning era arrived in force.

Cooling off

Today, there’s a small but serious group of researchers working on modern energy-based models, many of whom see Boltzmann machines as part of their prehistory. They’re trying to build tools that evaluate configurations rather than generate sequences, that score entire states rather than predict the next token. There’s a kernel of something interesting there.

But it’s also a space full of goofy handwaving. You hear about cognition as entropy minimisation. You hear about ‘thermodynamic computing’ and you start to notice that the more abstract the claim, the less likely it is to come with a working demo. Boltzmann’s name helps because it carries weight; people know it vaguely means something to do with probability and physics and systems finding balance.

But despite their relative lack of popularity, Boltzmann machines still matter to the history of AI. They might not have directly led to today’s most popular and powerful architectures, but they offered a particularly sharp version of a much older idea about the emergent nature of intelligence.

That idea was what made machine learning attractive from the start. What Boltzmann machines did was push it further, drawing directly from physics to provide a theory of learning as a thermodynamic process. Seen another way, the contribution of Boltzmann machines was more rhetorical than practical. Important, yes, but not because thermodynamic computing is going to replace large language models any time soon.

Learning From Examples

Discussion about this post