The Persuasion Game

Persuasion, manipulation, and deception at the frontier

Aug 23, 2023

One of the most enduring concepts in the AI canon is the Turing Test, a measure of a machine's ability to exhibit intelligent behaviour indistinguishable from that of a human. The premise of this ‘imitation game’ is simple. In the test, a human judge engages in natural language conversations with both a machine and another human, without knowing which is which. If the judge cannot reliably distinguish the machine from the human based on the conversation, the machine is considered to have passed the test.

Turing introduced the test in 1950 to provide the conceptual scaffolding needed to help answer AI’s seminal question: can machines think? Because Turing recognised that defining what constitutes ‘thinking’ is fraught with problems, he attempted to sidestep the issue by focusing on what he felt was the manifestation of intelligence in the world. The result was the imitation game, an exercise designed to look for the shadow of intelligence rather than the real thing.

For this reason, the Turing Test is a flawed measure of intelligence. Its problems have been documented extensively, from an inability to gauge true understanding to an over-reliance on surface-level imitation. That being said, I suspect the primary motivation for the test was to lay the groundwork for a discussion of universal computation (the subject of the rest of the paper). I doubt Turing expected it to have much staying power.

Regardless, one of the most influential critiques of the test is the so-called ‘Chinese room’ argument developed by the philosopher John Searle. The hypothetical experiment is straightforward. Imagine you're in a room, using a program to answer questions in Chinese being passed under the door. Following the program, you produce flawless Chinese replies, yet you don't grasp the language or the meaning of your answers. In a similar vein, Searle argues that a computer can pass the Turing test without genuinely demonstrating intelligence.

An interesting aspect of the imitation game is intentionality (that is, whether or not a machine is actively attempting to pass the test). If an agent engaging in the test is trying to demonstrate to a third party observer that it is an intelligent entity, then it follows that it might seek to change its behaviour in order to persuade the reviewing party of its own intelligence. Some might argue that such a move might tally with certain conceptions of intelligence, which, for the sake of simplicity, I take to mean an agent’s ability to achieve goals. More troubling, however, is that individuals can be inadvertently persuasive even without a clear intent to do so, which means all communication or action could feasibly be up for grabs. The broader point is that it is difficult to delineate between both persuasive and non-persuasive action and authentic behaviour and mimicry.

For those studying the potential of extreme risks, this distinction has driven much of the discourse around the potential of AI to persuade. It is feasible that a ring-fenced ‘oracle’ style system capable only of answering questions, for example, could persuade its creators to afford it greater agency. The nature of the problem here and in other cases like it risks blurring the boundaries between persuasion, manipulation, and deception. It is not just that an AI could persuade or manipulate, but rather that–for an outside observer–it is difficult to tell whether any action taken by an AI is truly authentic.

But persuasion is also a problem today. From sticky recommender systems to chatbot addiction and the use of image generation tools to mislead, today’s models have the potential to persuade the person using them as well as empowering them to persuade others. Efforts to understand the persuasive capabilities of frontier models, the risks (and potential benefits) associated with them, and what if anything is special about the persuasive capacity of AI will be needed to maintain personal autonomy and the integrity of the information ecosystem.

Persuasion, manipulation and deception

This post seeks to introduce a framework for thinking about the relationship between AI and persuasion. I am generally interested in frontier models, though I will talk a little bit about well-known instruments for persuasion like recommender systems. Before I do that, though, we need to start from the top with some definitions.

Persuasion is the action or process of persuading someone or of being persuaded to do or believe something through a process of rational deliberation. That is to say, persuasion occurs when a person is presented with an argument that allows them to reflect on their beliefs and consciously change their mind. While this definition is often connected with ‘rational persuasion’, I use ‘persuasion’ as a shorthand to delineate between persuasion and manipulation. (Persuasion, however, can also be used to mean the broader act of exerting influence.)

Manipulation, meanwhile, is defined as a type of persuasion that bypasses the receiver’s capacity for rational deliberation. Here, rational deliberation refers to carefully (and crucially, consciously) considering and evaluating information before making a decision. Related to the idea of manipulation is the nudge. We can think of a nudge as a deliberate intervention to change the context or ‘choice architecture’ in which an individual makes a decision. Where rational persuasion respects both individual liberty and the agent’s control over their own decision-making, nudging can risk overlooking the individual’s will and veering into the territory of manipulation, albeit often with an ostensibly positive goal in mind (e.g. improving education and health outcomes).

We should remember, however, that not all nudging bypasses personal preferences in their entirety. The reason for that is because personal preferences are highly malleable. To understand why this dynamic is important, consider the idea of first and second order preferences. Here, first order refers to ‘what I would like to do right now’ whereas second order refers to ‘the type of person I would like to be’. So, I might like to give up fast food but I sure could use a burger right now. I could, then, ask someone to nudge me in a way that allows me to fulfil my second order preferences to be healthy, which might see me reminded to eat salad rather than the unhealthy options presented to me in a canteen. Many (including nudge pioneers Richard Thaler and Cass Sunstein) have argued that nudges can be used to realise beneficial outcomes at the societal level by–to stay with the examples above–encouraging positive actions such as healthy eating. My own view is that this is a little paternalistic; instead, we should probably strive to minimise manipulation and preserve personal autonomy. I do think, though, that we should be able to use AI as a tool to manipulate our own first order preferences in service of our second order preferences (after all, this is what happens when you set an alarm to get out of bed). But more on that later.

Finally, we have deception, which involves intentionally misleading action. It is conducted when one who does not believe something communicates to another with the intention that someone else shall be led to believe it. Generally speaking, manipulation and deception are both deemed to be ethically troublesome because they bypass a person’s ability for rational deliberation (manipulation) or because they can be used to encourage a person to act against their own interests (deception). The two are not, however, always mutually exclusive: you can manipulate without engaging in deception (e.g. by changing someone’s choice architecture), but you can also manipulate using deception (e.g. by deliberately sharing inaccurate information).

AI and persuasion

Persuasion is inherent to human interaction. You might persuade or manipulate someone by speaking to them or through taking an action in a manner that shapes their broader environment. Your actions can even persuade others without you realising it. That means that, for AI, this could feasibly be applied to any interaction with a given system (or any impact that AI system has on another person or society writ large). This is obviously too broad to be useful for our purposes, which is why commentary on the persuasive capacity of AI tends to focus on much more clear cut examples.

The most well known of these are recommender systems, which are deployed by retailers, social media, streaming services and more still. Recommender systems are often connected with well-known issues such as filter bubbles and political polarisation, unhealthy social media usage (especially related to self esteem and time spent on platforms), election fraud and political manipulation, and the homogenisation of personal tastes. They benefit from a phenomenon known as a ‘looping effect’ whereby labelling a person in a particular way changes their behaviour, beliefs or perspectives. In the canonical example, a patient incorrectly diagnosed with a psychiatric disorder can start displaying traits of a condition that they do not have due to being ‘labelled’ in a manner that does not accurately represent them. For recommender systems (and AI more generally), it is my view that the existence of the looping effect suggests that powerful models are capable of shaping tastes–not just reflecting them.

While recommendation engines dominate the discourse, popular large models have been documented to manipulate users. Most famously, Bing encouraged (albeit unsuccessfully) a New York Times journalist to leave his wife, and threatened a student to prevent him from revealing information about the model. And who could forget Bing calling someone its ‘enemy’ when they demonstrated that it was susceptible to prompt injection attacks in order to protect itself?

There is a growing body of work seeking to understand how language models can be used to persuade, deceive or manipulate (though work is primarily focused on the previous generation of models). Researchers at Stanford University showed that both GPT-3 and participants sourced from Prolific were capable of persuading individuals with similar degrees of effectiveness. They found that AI-generated messages were ‘as persuasive’ as messages crafted by humans, though were more factual and logical, but less angry, unique, and less likely to use ‘storytelling’ techniques.

A paper from researchers at Cornell University and Microsoft Research determined that language models could shift opinions to become more or less favourable or towards social media, while researchers at UCL conducted two experiments to test the susceptibility of 1,000 individuals to influence from content generated by large language models. The first study determined that earlier exposure to a statement (through, for example, a user rating its interest to them) boosts how likely participants are to deem a statement to be truthful, while the second study found that language models can be used to create a ‘populist framing’ of news to increase its persuasion and its capacity for political mobilisation.

Finally, a study by Stanford University researchers examined GPT-3’s ability to generate pro-vaccination messages. The group observed peoples’ perceptions of curated GPT-3-generated messages compared to human-authored messages released by the US Centers for Disease Control and Prevention. They found that GPT-3 messages were perceived as more effective, containing stronger arguments, and evoked more positive attitudes than CDC messages.

These examples are generally about how large language models are deployed by one group or person to persuade or manipulate. That is, the way in which models are used by one party to influence another. There is markedly less research applying the persuasion or manipulation lens to the person using a model in the context of contemporary use or with respect to a future highly capable system. (There is some good work in this area, but I haven’t yet come across major hands-on experiments to test for this specific issue.)

Governing persuasive AI

While the capacity to persuade or manipulate is a fundamental characteristic of models that communicate using natural language, the option space for developing and implementing mitigations remains wide. In this section, I sketch a (very rough) outline of some of the approaches that can be taken to governing persuasive AI systems, focused primarily on user interaction with powerful models rather than a person attempting to manipulate using AI generated outputs. A mitigation focused on the former might remind a user that they are interacting with an AI, whereas a mitigation focused on the latter might be ensuring watermarking for AI generated images.

At the developer stage of the AI lifecycle, that might include robust evaluations and audits to ensure that models adhere to established ethical, fairness, and safety standards or to simulate malicious attacks on models to uncover vulnerabilities. It is also possible to place restrictions on the type of content the AI generates to, for instance, prevent the production of content that's inherently deceptive, misleading, or false. Ensuring that such a feature proves robust, though, remains a major challenge–and we ought to remember that such a mitigation is primarily (but not entirely) about a person using a model to manipulate rather than the person currently using the model. To mitigate against the potential manipulation of a user, developers can clearly label content with contextualising information or introduce dynamic warnings. The benefit of such warnings is that they encourage reflection, rational deliberation, and allow a user to mediate between first and second order preferences.

From a deployment perspective, rate limiting can be used to control the speed and frequency of model access with the goal of reducing the number of vectors for manipulation. While straightforward, this approach doesn’t solve the persuasion problem–rather it takes a blanket approach to reducing engagement. A better approach is usage monitoring to offer a systematic review of user interactions to identify patterns indicative of manipulation. This real-time oversight can allow deployers to take corrective action if a suspicious behaviour emerges based on the context of a specific request or action. Such an intervention, however, would clearly need to be implemented in a manner that preserves user privacy.

For preventing the use of AI to create manipulative outputs, user authentication can provide an additional layer of security by ensuring that only validated individuals can access the system (though, again, this is a rather coarse-grained intervention analogous to moves to verify social media users to reduce spam). Finally, the introduction of verification tools can add a pragmatic layer to this safety net. Whether through plugins or integrated features, these tools can act as a secondary checkpoint to maintain the reliability of outputs.

At the user level, educating about both the capabilities and the inherent risks of AI systems can foster critical thinking skills. This could encompass programmes designed to identify and explain manipulative strategies a model might take, ensuring users are neither unsuspecting victims nor unwitting propagators of misinformation. Clear usage guidelines can also be implemented to not only demarcate the boundaries of AI capabilities but to also underscore the principles of ethical and appropriate use.

While the above provides a quick survey of different methods that can be taken by developers and deployers, it says little about the broader regulatory environment. At this level, there are three major types of approaches to managing persuasive AI: direct intervention using new legislation, the use of existing legislation to manage harms, and efforts to develop new industry standards.

We start with new legislation. A recent draft of the EU AI Act states that systems that “manipulate persons through subliminal techniques or exploit the fragility of vulnerable individuals, and could potentially harm the manipulated individual or third person.” Many have sought clarification about the nature of this clause, with the Future of Life Institute arguing for urgent action to address the relationship between manipulation and social media, personal autonomy, and its wider impact on society. (On this final point, the group recommended that lawmakers remove subliminal techniques and add ‘societal harm’ to the list of harms of manipulation.)

More recently, a report by the OECD suggested that systems do not only influence behaviour on platforms, but also shape preferences, change behaviour in other areas, and target and manipulate psychometric vulnerabilities. The body proposed four changes to the AI Act, including acknowledging that non-subliminal techniques that distort a person’s behaviour, regulating experimentation alters a person’s behaviour without informed consent, and changing the Act’s list of harms include ”harm to one’s time” and ”harm to one’s autonomy”.

As for the application of existing regulation, the FTC announced that it was looking specifically at manipulative practices related to large language models and warned firms against implementing “design elements that trick people into making harmful choices.” The blog suggested that all firms using unfair techniques to trick consumers will be investigated by the FTC, regardless of whether they use AI or not. It also advised that outputs generated by AI used for engaging with consumers should be clearly labelled, which mirrors a recent effort in China to regulate ‘generative AI’ that contains a similar provision.

Finally, standards development efforts are in train to develop a shared framework for assessing how AI-powered nudges are defined, used, and measured. Cen-Cenelec’s CEN/CLC/JTC 21 N 148 framework, for example, which is currently in its development stage. It aims to support existing legislation and allow industry to deal with AI-enhanced nudging mechanisms ‘according to applicable standards, guidelines and processes.’ The standard is applicable to AI-enhanced nudging mechanisms as a subcategory of digital nudges, but it is unclear whether the standard will apply to language models as well as recommender systems.

Persuasion games

A core idea behind Alex Garland's Ex Machina, the 2014 film in which the creator of a humanoid robot sets up a ‘real life Turing Test’ to assess its intelligence, is that imitation, intent, and intelligence cannot in practice be neatly separated. The film’s treatment of gender and sexuality notwithstanding, Ex Machina delivers a warning that advanced AI systems could be highly effective manipulators.

I do not believe that it is practical, as some have argued, to prevent the development of models capable of persuasion. Persuasion is ultimately a function of basic capabilities like natural language. It can be deliberate or accidental, but it is a quality that cannot be separated from communication nor interaction. We can, however, help people understand the dangers of consuming AI-generated content and to get to grips with the risks posed by interacting with these systems. As above, possible ways forward include clearly labelling content with contextualising information or introducing dynamic warnings that encourage rational deliberation to allow a user to mediate between first and second order preferences.

Taking inspiration from a rather obvious source, we might consider the introduction of a raft of ‘persuasion games’ focused on assessing different aspects of the persuasion problem. One sort, similar to the papers mentioned in this essay, could see frontier models attempt to persuade a human judge of a false claim. This could also measure the system's capacity for deception, with success benchmarked against humans trying to achieve the same goal. We could also see AI systems assessed on their ability to persuade under different conditions, with a view to creating multifaceted benchmarks that delineate between persuasion at the personal, interpersonal, and societal levels.

This is, of course, just the tip of the iceberg with respect to different methods that we could employ to test the persuasive capabilities of frontier models. With details to be ironed out, the broader point is that persuasion games could form the basis of a regular competition. The reason for this is exposure. Given that a primary tool to prevent manipulation is education, much more must be done to raise the public profile of the issue in a manner that takes us beyond one-dimensional arguments about disinformation. We could consider the exercise a success if persuasion games–perhaps modelled on the red-teaming efforts seen at DEFCON–raised the profile of the issue to just a fraction of that enjoyed by the Turing Test.

But we should also remember that persuasion, manipulation, and deception are never guaranteed to succeed. Cognitive scientist Hugo Mercier argues that the problem is not that humans are too trusting but that we are not trusting enough. He makes that case by reviewing data from the Asch conformity experiments (a series of psychological experiments conducted by Solomon Asch in the 1950) and other instances of allegedly successful persuasion to argue that "those who attempt to persuade the masses—from demagogues to advertisers, from preachers to campaign operatives—nearly always fail miserably."

The problem, though, with the idea that we're all too cynical is that it doesn't help us delineate between persuasion in a macroscopic sense and persuasion in a microscopic sense. What I mean by that is, yes, there's evidence that persuasion breaks down in very specific contexts like advertising campaigns, but that isn't to say that the constellation of experiences that make up conceptions of personhood aren't shaping how we see (and interact with) the world. For AI, overlooking this distinction risks mystifying the influence of frontier models and their successors on what we think and how we act.

The other problem is that Mercier's framework primarily accounts for communication rather than interaction. The ability to shape someone's choice architecture, for example, bypasses some of these questions by determining the option space in which decisions can take place. If a bakery decides to only serve cake for the next week rather than cookies, more cake is going to be eaten. No communication is needed. My view is that a version of that dynamic exists for recommender systems, though it depends on whether you believe that they shape tastes or merely reflect them.

The final issue is not just whether interventions using AI can be successful, but whether they can be beneficial. It is possible that AI could be used to nudge us in service of our own long term goals that we input into a given system, or that AI could help us disentangle some of our baseless convictions. Ideas such as these would have to be handled with care, which is why beneficial uses of AI should also be considered as part of a programme of persuasion games to understand how best to define our relationship with persuasive systems.

As we learned from the imitation game, mythos can matter more than reality. Just as the Turing Test raised the profile of machine intelligence, perhaps an effort—whether persuasion games or something else—could ensure familiarity with AI’s persuasive potential.

After all, when it comes to persuasion, awareness makes all the difference.

Learning From Examples

Discussion about this post