Do most people really want to slow down AI?
Recent surveys show interesting results, but should be handled with caution
“Public opinion about AI can be summed up in two words: Slow. Down.” That was the eye-catching claim made by Vox in its Future Perfect newsletter earlier this month. The article carried other topline findings from a poll commissioned by the AI Policy Institute, reporting that “A whopping 72 percent of American voters want to slow down the development of AI” while a further 58 percent of voters want voters to “thoroughly” regulate the technology.
The same article also proposed that 76 percent of voters want AI-generated images to be required to contain proof they were created by a computer, and that 65 percent of voters support compelling developers to demonstrate that advanced AI models are safe before they are released.
The poll itself, the results of which are available here, was conducted by YouGov and surveyed 1,001 Americans in July earlier this year. Of those who responded to dozens of questions administered online, 47% identified as Democrats and 40% considered themselves to be Republicans. There was a roughly equal split of respondents across five age groups from 18 to 65+, while the gender split was broadly 50:50. 73 percent of respondents identified as white, 12 percent as Black, and 7 percent as Hispanic (which is in the ballpark of the broader US national makeup according to 2023 census results).
Now, polls are rarely perfect reflections of reality, so you might think that this is more or less representative of the country’s opinion writ large. Because 72 percent of American voters want to decelerate the pace of AI development—and because the sample size seems to be large and maps well enough onto U.S. demographics—that means that there’s an overwhelming public consensus that AI development ought to be slowed down. That’s it, case closed. Sorry folks.
Unfortunately, that isn’t how polls work.
Before we take a step back to understand why, though, I should say that I like Vox’s Future Perfect and my own position (unsurprisingly) is that governance is an essential part of making sure that we build AI in a way that lives up to its potential. I’m writing this piece because I think that we ought to scrutinise big claims and the methodology that they rely on (regardless of who is making them).
To do that, let’s start with the sample size. Generally speaking, with proper sampling techniques a sample of 1,000 should provide a fairly accurate representation of general public opinion. The key factor here is randomness, which ensures that every individual in the population has an equal chance of being selected. This minimises biases and ultimately makes the results more generalisable with respect to the entire population.
There’s lots of ways to make sure that happens. Polls often employ random digit dialling, where telephone numbers are generated at random, ensuring both listed and unlisted numbers are included. Online surveys may use random sampling from large panels of participants in order to include a diverse range of respondents. There’s also the option of stratified sampling, where the population is divided into subgroups and individuals are randomly selected from each subgroup to achieve a more representative sample (more on that in a moment).
The broader point, though, is that with a sample size of 1,000 respondents, we typically get a margin of error of around 3% with perfect sampling assuming a confidence level of 95%. Here, confidence level indicates the probability that the true parameter falls within the range of values derived from the sample data. If we have a confidence level of 90%, we would expect the true population parameter to sit within the confidence interval 90 out of 100 times if we were to repeat the poll. The margin of error, meanwhile, tells us how much we'd expect the actual results to vary from what the results of the poll show. We can basically think of the confidence level as how sure we are, and the margin of error as the range around our poll's result where the true answer lies.
According to the survey’s crosstabs (tables used to display the relationship between two or more categorical variables) the margin of error for this survey is 3.3% with a confidence interval of 95%, which is about what we would expect. This means the real opinion of the public could be 3.3% more or less than what the poll suggests, and the pollsters think that they would get the same result 95 times out of 100 if they ran the survey repeatedly. (As we shall see, however, a 3.3.% margin of error on a survey question does not necessarily translate into understanding views about complex ideas like the ‘thorough regulation’ of AI.)
The sample was weighted according to gender, age, race/ethnicity, education, and U.S. Census region based on voter registration lists, the U.S. Census American Community Survey, and the U.S. Census Current Population Survey, as well as 2020 Presidential vote. Respondents were selected from YouGov to be representative of registered voters.
— Details of the survey, available on page 84 of the crosstabs
But even with strong confidence levels and a low margin of error, there are still a number of issues about how reliable the data becomes as we focus on specific groups within the main corpus. The point here is that statistical confidence for a 1,000-strong survey size may not include enough participants from minority groups to provide significant results for those subgroups. While a larger sample size will reduce the margin of error, it also means that we can dice the data according to particular subgroups, which in turn gives us statistical confidence in the representativeness of the figures for different constituencies.
For instance, if we’re studying a topic that might affect younger people differently than older people, we might want enough participants in both age groups to make valid comparisons. A bigger sample ensures that even when the group is divided into subgroups, each has a sufficiently large sample size to analyse. Nuanced topics can also engender a wider range of opinions, which means that a greater sample captures this diversity more comprehensively and ensures that minority opinions within the topic aren't overlooked or underrepresented.
The lizardman in the room
So, if the sample size is big enough to give us a representation of the US population within the margin for error, we might say it’s useful for gauging national opinion but perhaps not for understanding the perspectives of sub-groups within the sample set.
Not quite. There are a whole bunch of reasons that we can’t trust the responses to be representative of public opinion on complex issues. The four most important are related to question wording and order, how views change over time, oversimplification, and what we can describe as social desirability bias.
How we ask questions matters as much as what we ask. One question, for example, asked respondents whether they agreed or disagreed with the following statement: “Some policymakers are proposing that AI systems used in military applications should be internationally regulated, similar to nuclear weapons. Supporters say this could prevent AI warfare. Opponents say it may weaken national defence.”
There are a few things to say here, beginning with the complexity of the topic at hand and the construction of the question itself. It’s not all that clear to me whether ‘AI warfare’ means the use of AI by one nation against another, or the more totalising concept of a rogue superintelligence. You might scoff, but I would not underestimate the power of cultural iconography like The Terminator franchise influencing the way that the public might interpret such ambiguity.
Then there’s the complexity of the topic at hand. The governance of nuclear weapons relies on a web of bilateral and multilateral agreements, with different states and international organisations playing different roles in enforcement. But, crucially, while the spread of nuclear weapons is subject to the Non-Proliferation Treaty, the use of nuclear weapons (the framing of the question) is not. While treaties like New START govern the number of weapons or type of weapons certain countries have, different countries have different policies regarding first use.
Most troubling, though, is that regulation is presented as zero sum. It’s not defined within the bounds of the question, which means that people are likely to err on the side of caution when faced with a question that connects AI with nuclear weapons. This is a comment that can be applied to a number of different questions throughout the survey, such as those that asked whether AI “could” cause a catastrophic event, “could” become more intelligent than humans or “could” do a job similar to their own. The reasonable answer in all of these cases is, of course, yes. But what does that really tell us about public opinion? What does it tell us about how to form policy?
While less relevant for our purposes, the order that questions are asked in also makes a difference to the response that we get. Issues here include context effects, like how a query on personal health can sway answers on healthcare policy; priming, as when a crime rate question affects feelings on safety; question fatigue leading to hasty answers in long surveys; and consistency pressure pushing respondents to maintain early survey stances even if they change their mind later. I don’t think this survey is constructed in a way designed to harness these effects, but nonetheless they are factors that shape how the public tends to answer questions that we ought to be aware of.
Views also change over time. Opinions about a political leader, for example, might change after a significant policy announcement or a notable public event. In the context of this survey, we might hypothesise that recent media coverage about the potential dangers of AI will have shaped responses. Similarly, given the speed at which opinions can change, data from surveys can quickly become outdated. What might be a prevailing sentiment one week could be overshadowed the next. Take the COVID-19 pandemic, which saw public views on health measures like lockdowns shift after outbreaks of the virus. While this goes both ways, it’s important to recognise that it is possible that opinion may change based on how AI is used in the future, and in what way its impact is discussed within the cultural canon.
There is also the phenomenon known as ‘social desirability bias’ in which some people might provide answers they believe are socially acceptable rather than their true opinion. (Related to this is the idea that people regularly express opinions in polls rather than saying that they don’t know.) In any case, social desirability bias represents the gap between what people genuinely think or do and what they report in surveys due to the influence of societal norms and the desire for approval.
Consider the question:”How likely do you think it is that an AI could accidentally cause a catastrophic event?” Respondents might be influenced by recent high-profile media stories about the dangers of unchecked AI, prominent figures warning about AI risks, or popular science fiction portrayals of AI going awry. Even if a respondent personally believes that with proper safeguards AI is unlikely to cause a catastrophe, they might lean towards a response of ‘very likely’. This could be because they feel that expressing caution is the socially responsible stance or that doing otherwise might portray them as uninformed about potential technological risks.
Finally, before we wrap up, what article about polling would be complete without a mention of the Lizardman’s Constant? The term originates from a reference to the fraction of people in any given poll who seem to give unexpected or bizarre answers. It is named after surveys that have included questions about whether respondents believe in "lizardmen" controlling the world to which a consistent small percentage of people tend to affirmatively respond.
The Lizardman's Constant underscores the existence of noise inherent even in the most meticulously crafted surveys. Recognising that a certain fraction of respondents might provide bizarre answers means that we ought to approach results that fall within this bracket with caution. The Lizardman's Constant doesn't directly affect the calculated margin of error, but it is an additional source of potential inaccuracy for us to think about when analysing results. In essence, while the margin of error accounts for the inherent variability due to sampling, the Lizardman's Constant (and, after all, it is a constant) reminds us that there are other non-statistical factors that might affect the accuracy of our results.
Parting thoughts
Surveys are notoriously difficult to do. I don’t mean to be critical of those trying to understand public opinion of AI, because—contrary to what the above might suggest—I think it can play an important role in guiding governance decisions at the organisational, national, and international levels. What I am interested in, though, is the point at which these sorts of polling efforts are described as being representative of an “overwhelming bipartisan consensus”.
Hyperbole helps no-one. It risks oversimplifying highly complex questions in a manner that actually clouds, rather than clarifies, the debate. Such approaches might be successful in influencing policy in a way that leads to concerning outcomes over the long term (though for AI I suspect this is a problem that has yet to materialise).
Nonetheless, the survey, like many others recently released such as an effort by the UK’s Ada Lovelace and Alan Turing Institutes, demonstrates a likely disconnect between what those building AI think and what the public believes (something Anthropic’s Jack Clark discussed in a recent edition of his newsletter). That all said, the scale of disconnect—and the specific points of disagreement—remain unclear.
There’s much more work to be done to figure out what people really think about AI before we consider translating these perspectives into concrete governance measures. Questions about how to use public opinion, who we mean by the public, and what sorts of issues we ought to seek input on are yet to be answered.
Another consideration is how the firms building AI should use data. The extent to which they ought to draw on survey data compared to other data sources (such as information about the public’s actual use of AI or more in-depth consultations with minority groups) remains an important problem. Some AI labs are accelerating efforts to incorporate public input via initiatives spanning alignment assemblies, community fora, and efforts to boost democratic deliberation.
Ultimately, surveys may play a role in helping different parties understand the broader landscape and challenge their own beliefs. Though difficult in practice, they may even have a role to play in informing the development of certain AI systems (possibly in combination with some of the other methods described above).
Whatever the use-case, though, survey data must always be approached with care. After all, polls are a mirror—not a crystal ball.