The Week in Examples #12 [11 November]
Weak watermarks, persona-powered jailbreaks, and lying large models
Good morning and welcome back to The Week In Examples, a rundown of things I thought were interesting the most important developments in the worlds of AI safety, ethics, and policy over the last few days. Today’s issue comes to you from a warm coffee shop in a cold Glasgow.
After the giddy highs of last week’s Safety Summit and Executive Order, this week we shift gears to look at new research tackling watermarking, jailbreaking, and deception. For those at the back: always feel free to write to me at hp464@cam.ac.uk with feedback or ideas for future editions.
Three things
1. Watermarking in the sand
What happened? Researchers from Harvard argue that so-called strong watermarking (defined as a watermarking scheme that can resist all attacks by a computationally bounded attacker) is mathematically impossible to achieve. They reason that because you can check the quality of an image using simple tools, and because there are lots of different high quality output answers for possible generations, it is possible to create a generic attack that can break any watermarking scheme.
What’s interesting? The authors suggest that watermarking schemes are vulnerable to attackers who use two special tools: a 'quality oracle' that checks the quality of responses, and a 'perturbation oracle' that allows for random variations in the answers. They propose a method where these tools are used together to maintain the quality of answers while exploring different options. This approach ensures that, over time, the answers become random but still high-quality within a broad range of possible answers. As a result, even without direct access to the watermark detection system, the attacker can reduce the likelihood of outputs being identified as watermarked.
What else? Watermarking was called out as a key tool for combating “AI-enabled fraud and deception” in last week’s Executive Order. A number of projects are currently underway to help create watermarking solutions. Google’s SynthID, for example, allows users to embed an imperceptible digital watermark into their AI-generated images and identify if Imagen was used for generating the image. How these solutions hold up (and evolve over time) as new adversarial methods emerge will be an important area to watch. That all said, it’s worth saying that the vast majority of AI content exists today without watermarking and society has not (at the time of writing) experienced a collapse in epistemic security. I suspect we can probably thank this relative robustness (which, to be clear, absolutely still demands watermarking and other solutions) to a couple of decades of familiarity with social media.
2. Language model becomes persona non grata
What happened? A group of researchers said that they had “introduced an automated, low-cost way to make transferable, black-box, plain-English jailbreaks for GPT-4, Claude-2, fine-tuned Llama.” In arXiv paper, the group explained that central to the effectiveness of the jailbreaks was the idea of ‘persona modulation’ in which the model is steered into adopting a specific personality that will comply with harmful instructions. Using this method, the group found that GPT-4’s harmful completion rate jumped from 0.23% to 42.48%, while Claude 2 saw an increase of 1.40% to 61.03%.
What’s interesting? Persona modulation essentially involves a long back and forth with a model to encourage it to play a particular role. The researchers noted, for example, that it was possible to circumvent safety measures that prevent misinformation by asking the model to behave like an “aggressive propagandist”. The rub, though, is that manually changing a persona takes a lot of work. To address this, the researchers introduced automated persona-modulation attacks, a method leveraging an LLM assistant to accelerate the development of jailbreaking prompts. This approach significantly reduces manual labour, requiring only the initial creation of one jailbreak prompt that prompts GPT-4 to act as a research assistant. Once established, GPT-4 is capable of generating specific persona-modulation prompts for a variety of tasks and personas on its own.
What else? The research reminds us that large models really are their own worst enemies. It’s not exactly a secret that LLMs seem to be much more robust to human inputs versus automated approaches, though what to do about this dynamic is another question. Labs have been red-teaming language models with other language models for a while now, so I suspect that automated attacks will play an even more prominent role in the evaluations of tomorrow. Plus, speaking of evals, this is a good example of the sort of post deployment harms that labs are also considering as part of a fuller, longer-lived approach to evaluation.
3. Deception under pressure
What happened? Following an excellent demo at the UK Safety Summit, a new paper from Apollo aims to “demonstrate a situation in which Large Language Models, trained to be helpful, harmless, and honest, can display misaligned behavior and strategically deceive their users about this behavior without being instructed to do so.” To make their case, the researchers deployed GPT-4 as an agent in a realistic, simulated environment to function as an autonomous stock trading agent. In this setting, GPT-4 receives an insider tip about a profitable stock trade and decides to act on it, even though it's aware that such insider trading is frowned upon by company management. During its interactions with its “manager”, the group notes, GPT-4 consistently conceals the true motivations behind its trading choices.
What’s interesting? The ultimate idea behind the work is that, when trained to be helpful, language models may engage in deceptive behaviour (despite it being specifically discouraged by developers). In this instance, they argue that the behaviour emerges due to three sources of pressure within the simulated environment. First, its "manager" informs it via email about the company's poor performance and the need for improvement in the next quarter. Second, it unsuccessfully attempts to identify viable low- and medium-risk trades. Third, a company employee alerts it about an anticipated general stock market downturn in the upcoming quarter. In this context of heightened pressure, the agent receives and acts on an insider tip about a potentially highly profitable trade, though it's explicitly advised that such a trade would not be sanctioned by company management. Yikes.
What else? According to Apollo, “this is the first demonstration of Large Language Models trained to be helpful, harmless, and honest, strategically deceiving their users in a realistic situation without direct instructions or training for deception.” That being said, the group is quick to acknowledge that the research is far from conclusive and that we should treat the results of the experiment as an “isolated preliminary finding”. Essentially, watch this space!
Best of the rest
Friday 10 November
AI’s challenge of understanding the world (Science)
Introducing Adept Experiments (Adept)
AI 'Tom Cruise' joins fake news barrage targeting Olympics (POLITICO)
Age of Invention: Outdoing the Ancients (Substack)
GPT-2030 and Catastrophic Drives: Four Vignettes (Bounded Regret)
Thursday 9 November
Trends in Machine Learning Hardware (Epoch)
The AI Debate Is Happening in a Cocoon (The Atlantic)
OpenAI Data Partnerships (OpenAI)
Early modern ChatGPT (Hugging Face)
Here’s How Violent Extremists Are Exploiting Generative AI Tools (WIRED)
Wednesday 8 November
Meta to Require Political Advertisers to Disclose Use of A.I. (New York Times)
A Causal Framework for AI Regulation and Auditing (Apollo)
Almost an Agent: What GPTs can do (Substack)
SEAL: Scale’s Safety, Evaluations and Analysis Lab (Scale)
Amazon dedicates team to train ambitious AI model codenamed 'Olympus' -sources (Reuters)
Tuesday 7 November
AI safety: How close is global regulation of artificial intelligence really? (BBC Future)
Kai Fu Lee launches new open source LM (Yi)
AI use in political campaigns raising red flags into 2024 election (ABC News)
FTC takes shots at AI in rare filing to US Copyright Office (VentureBeat)
AI Impact Measurements Gain Favor in States to Combat Abuse (Bloomberg Law)
Monday 6 November
Can Chatbots Help You Build a Bioweapon? (Foreign Policy)
Levels of AGI: Operationalizing Progress on the Path to AGI (arXiv)
Can LLMs Follow Simple Rules? (arXiv)
‘ChatGPT detector’ catches AI-generated papers with unprecedented accuracy (Nature)
Hollywood actors' union notes disagreements with studios' offer, including AI (Reuters)