A science of evals, cybersecurity, and power-seeking behaviour [TWIE #23]
The Week In Examples | 27 January 2024
It is another Saturday morning, which means it’s another round-up of the mixture of things I enthusiastically bundle together as news, reports and research about AI and society. This time around, we have calls from Apollo Research for a science of AI evaluations, a report from the UK’s National Cyber Security Centre (NCSC) making the case that AI will “almost certainly” increase the volume and heighten the impact of cyber attacks, and research reviewing major arguments about existential risk.
As always, message me at hp464@cam.ac.uk if you have any thoughts on the newsletter or what would make it better. I’m also considering adding job vacancies for AI policy, safety, governance, ethics, ops etc. roles into future editions, so I’d be interested to know whether there’s any appetite for that.
Three things
1. Researchers concerned evals don’t measure up
What happened? The folks at AI safety group Apollo made the case for a ‘science of evals’ in a new blog. The authors argue that researchers typically struggle to comprehensively identify the upper bounds of capabilities—essentially, what a model is potentially capable of—because you can squeeze a lot of juice out of LLMs using exotic prompting regimes. Researchers at Microsoft, for example, used a special style of prompting to get a vanilla version of GPT-4 to beat models that were specifically finetuned on medical knowledge on the MultiMedQA benchmark suite.
What's interesting? One interesting aspect of the paper is the collision of two important, though not always complementary, ideas. First, because we need to measure the potential of models, “it is important to understand how to elicit maximal rather than average capabilities.” Second, models are highly sensitive to prompts, which means we need to test them repeatedly to determine how they perform over time and in the real world: “In contrast, even everyday products like shoes undergo extensive testing, such as repeated bending to assess material fatigue.” The central tension here is that we need to make sure that we know both the upper bound of dangerous capabilities and what the impact of the model is likely to be in the wild. If we are, as the post suggests, using evals to shape policy, should we do so based on the worst case scenario or typical usage?
What else? Put in other terms: understanding model capabilities and agreeing appropriate risk thresholds are two different things. For the aviation industry, which the authors also reference, we know that planes have the potential to crash but we still let them fly. To inform good policy, we need to know a) the maximally bad outcomes associated with the most capable models; b) how likely these outcomes are to occur post-deployment; c) the nature of the marginal risk posed by the models (i.e. how much riskier they are relative to existing technologies); and d) the opportunity cost of not deploying a given model. That all said, pushing for a science of evals—with robust and widely adopted standards and norms—to better measure a) and b) is a good place to start.
2. Cyber security group interrogates large models
What happened? The UK’s National Cyber Security Centre (NCSC) analysed the impact of AI on cybersecurity, arguing that AI “will almost certainly [95% – 100% chance] increase the volume and heighten the impact of cyber attacks over the next two years.” In a very readable eight-page report, the NCSC also said AI will “primarily offer threat actors capability uplift in social engineering” by assisting in translation and preventing spelling and grammatical errors. As for more advanced capabilities like malware development, the organisation said only those with resources, expertise and data will “benefit from its use in sophisticated cyber attacks to 2025.”
What’s interesting? It wasn’t all bad news, with the report acknowledging that “AI can improve the detection and triage of cyber attacks and identify malicious emails and phishing campaigns, ultimately making them easier to counteract.” While the assessment says “more work is needed” on these defensive benefits of AI adoption, the dynamic draws into focus one of the major questions surrounding the development of sophisticated AI systems: does their diffusion favour the attacker or the defender? This issue is often debated in the context of biorisk, with AI safety advocate Jeffrey Ladish recently saying that the provision of extremely capable systems would mean that “everyone could make pandemic viruses” while Anduril founder Palmer Luckey reckons that a diffusion of biotechologies favours defenders over the long term.
What else? For cybersecurity, like biorisk, it’s hard to know just what the offence-defence balance looks like. Firms have been creating AI-powered cybersecurity systems for the last few years, like the UK’s Darktrace, and the last few months have seen a few new entrants like Wraithwatch and Cranium raise money to build countermeasures to AI cyberthreats. In the near term, I don’t expect commercially available generative AI systems to move the needle in favour of bad actors, primarily because people are already primed to resist fraud in the age of social media and ubiquitous internet access. The real question is, by the time systems are capable of automatically executing sophisticated cyberattacks with little or no oversight, how good will defensive countermeasures be?
3. Power-seeking AI finds critics
What happened? A new paper from researchers at Rutgers University and the University of Oxford takes stock of long-running debates about the extent to which AI is likely to pose a catastrophic or existential risk to humanity. It primarily focuses on so-called ‘power seeking’ behaviour, which the authors define as “the notion that some advanced AI systems are likely to function as agents pursuing goals, and as a result, are likely to engage in dangerous resource acquiring, shutdown-avoiding, and correction-resisting behavior.” This paper touches on a lot of the major beats related to the extreme risk discourse, so I’d recommend it for anyone interested in an overview of the field through a critical lens.
What's interesting? The authors basically argue that AI power-seeking behaviour may be relevant for a specific sub-set of goals rather than all goals that an AI system may have. They make the case that what matters for problematic sub-goals (e.g. rapid self improvement, preventing shutdown, acquiring resources) is the future goals of a given system rather than all possible goals that it could potentially complete. While I agree with their argument that the actual goals of a future system matter more than the range of goals that a system could hypothetically undertake, the rub is that many actual goals of a powerful system may fall inside the range of those deemed to be conducive to power-seeking behaviour. Of course, they may not (because it’s hard to say which goals may provoke certain types of action) but therein lies the problem: we simply don’t know.
What else? These sorts of risks are being looked at in more detail by major developers, while the UK’s AI Safety Institute is hiring for a ‘loss of control workstream lead’. The profile of such risks has increased substantially in the last 18 months or so as progress in capabilities has compressed timelines and dissolved some of the longstanding barriers between ‘short-term’ and ‘long-term risks’. Of course, risks that have already manifested themselves in the real world demand urgent attention—but I expect (and hope) to see more groups tackle existing and speculated risks as part of a more holistic approach to AI development and deployment.
Best of the rest
Friday 26 January
AI Needs So Much Power That Old Coal Plants Are Sticking Around (Bloomberg)
AI Survey Exaggerates Apocalyptic Risks (Scientific American)
TV channels are using AI-generated presenters to read the news. The question is, will we trust them? (BBC)
Data gold rush: companies once focused on mining cryptocurrency pivot to generative AI (The Guardian)
Launching our new hub in San Francisco (EF)
Thursday 25 January
A New National Purpose: Leading the Biotech Revolution (TBI)
FTC Launches Inquiry into Generative AI Investments and Partnerships (FTC)
White House science chief signals US-China co-operation on AI safety (FT)
X is being flooded with graphic Taylor Swift AI images (The Verge)
We Asked A.I. to Create the Joker. It Generated a Copyrighted Image. (NYT)
Wednesday 24 January
Will AI transform law? (AI Snake Oil >> Substack)
MambaByte: Token-free Selective State Space Model (arXiv)
Most Top News Sites Block AI Bots. Right-Wing Media Welcomes Them (WIRED)
Apple boosts plans to bring generative AI to iPhones (FT)
EU wants to upgrade its supercomputers to support generative AI startups (TechCrunch)
Tuesday 23 January
Unsocial Intelligence: a Pluralistic, Democratic, and Participatory Investigation of AGI Discourse (arXiv)
Visibility into AI Agents (arXiv)
Washington state's plan to combat election deepfakes (Axios)
NSF launches AI research hub to broaden access to infrastructure and education (SiliconAngle)
North Korea's AI development raises concerns, report says (Reuters)
Welcome to AI university (Politico)
Pope Francis warns of ‘perverse’ AI dangers (The Hill)
AI and crypto mining are driving up data centers’ energy use (The Verge)
A.I. Should Be a Tool, Not a Curse, for the Future of Work (The New York Times)
Monday 22 January
Sounding the Alarm: AI’s Impact on Democracy and News Integrity (Fordham Law)
WARM: On the Benefits of Weight Averaged Reward Models (arXiv)
Full final text of EU AI Act (X >> Luca Bertuzzi)
Customer satisfaction plunges as AI chatbots take charge (The Telegraph)
How AI Can Help Humans Become More Human (TIME)
State and local meddling threatens to undermine the AI revolution (The Hill)
How federal agencies can integrate Generative AI and automation (Federal Times)
AI is destabilizing ‘the concept of truth itself’ in 2024 election (The Washington Post)