The Year in Examples 2023
Speedrunning 2023 and thinking about what comes next for AI and society
Instead of our regular The Week in Examples, the folks at the lab have conjured up The Year in Examples: a slaloming run through the things that I think say something about the relationship between AI and society as we look ahead to next year. There isn't much reason or rhyme to the different areas I’ve picked, other than they are the things I’m interested in that collectively tell us something about our current moment.
Before we get started, though, I want to say thanks very much to everyone who has read, commented, and emailed over the last five months. We move forward into 2024, where I hope to get back to writing more essays. As always, message me at hp464@cam.ac.uk for comments, ideas for next time, or to say hello.
Foundation models put down roots
GPT-4 was released in March. Despite concerns about whether it’s getting worse, whether it’s becoming lazy, or whether it's too woke, it remains a phenomenally capable model that I use pretty much every day. A successor will soon replace GPT-4, so in the grand scheme of things, its significance comes down to just how much better it is than the GPT 3.5 model that it followed. After March, no one was saying that capabilities weren’t zooming along (well, I guess some people will still be saying it’s all smoke and mirrors at the heat death of the universe) and more people than ever were asking whether scale is in fact all you need.
Then came the rest. We had two versions of Anthropic’s Claude in January and July, “the second best model in the world” from self-styled AI studio InflectionAI, and Grok from Elon Musk’s new xAI group. Not to mention the Gemini model from my colleagues at Google DeepMind. There was also a slew of smaller ‘open source’ releases (though, often not entirely open source) like the Falcon family of models from the Technology Innovation Institute in Abu Dhabi, Mistral’s 7B and Mixtral models, and of course Llama 2 from Meta.
So, lots of models were released in 2023, but how much are they being used? Well, a lot as it turns out – especially by young people. Let's start with ChatGPT, which Reuters reckoned had amassed a total of 180 million unique monthly users as of September. Say it quietly, but I have also heard estimates (it was communicated to me in a dream if you must know the source) that put the figure at closer to 400 million or about 10% of global internet users. That would mean that somewhere from between about 5-10% of global internet users have visited the ChatGPT platform since its release at the tail end of last year. And of course Bard, Character AI, Bing and others have all built substantial user bases, too.
As well as taking a top-down look at adoption from the point of view of a given platform, we can also look at usage in the real world through the imperfect lens of survey data. In the UK, national communications regulator Ofcom reported that in 2023 over half of adults with internet access used a generative AI tool in the last year. ChatGPT was the most widely used application at 23%, followed by Snapchat’s My AI (15%), Bing Chat (11%), Google Bard (9%), and Midjourney (9%). Young people are most likely to use Snapchat’s My AI, which became freely available to all Snap users in April 2023 and is used by about half (51%) of 7–17-year-olds in the UK. As for enterprise, the training provider O’Reilly said that two-thirds of businesses in North America or Europe were using generative AI by November 2023 (with 16% of those saying they were using open source models).
There are lots of different ways people use large models, which makes it challenging to understand the net impact on education, jobs, media consumption, arts and culture, and the information environment. At the risk of shooting from the hip, I generally think that commentators have overcooked the short term societal impact of AI. We all hear about the impending deluge of misinformation, collapse of educational integrity, and help making bioweapons. As it turns out, though, Harvard thinks misinformation is overpriced, Stanford says cheating doesn’t seem to be enjoying a shot in the arm, and RAND concedes that today’s large models don’t seem to be any better at helping someone conduct biological terrorism than a run-of-the-mill search engine. Clearly, there are lots of individual problems that do exist (for example, students being wrongly accused of plagiarism in the case of education) but this is exactly my point: there is a risk that the treatment is worse than the ailment.
While it’s clear that people are using large models in various professional and personal contexts, what the technology means for the broader economic environment is a bit more murky. In August, OpenAI released its ‘GPTs are GPTs’ paper which found that around 80% of the U.S. workforce could have at least 10% of their work tasks affected by the introduction of LLMs. Then there have been plenty of anecdotal reports about people in the creative industries (primarily artists or copywriters) who have struggled to find freelance work since generative AI platforms were released en masse. In general, I expect much more to come in 2024 as companies, countries, and people grapple with how these technologies reshape the economic environment. The most likely outcome that I see is the integration of the tools into jobs in the short term, but that dynamic is up for grabs depending on how the technology develops.
That takes me to my next point. Once upon a time we talked about ‘large language models’ then ‘foundation models’ and now more recently ‘frontier model’ is the preferred moniker. The core difference here (aside from what it means to be a platform versus what it means to be at the cutting edge of development) is that language models are generally confined to, and defined by, a single modality.
Not so, though, these frontier models, which can respond to inputs and generate outputs that are aural, visual, or textual in nature. At the start of 2023 there were no major consumer multimodal models. Now, millions of people send ChatGPT photos of clothes to find a corresponding brand, half empty fridges when lacking for culinary inspiration, and in-progress scrabble boards when they feel like cheating. I like to practise my (extremely rough) Spanish with Inflection’s Pi, and send ChatGPT screenshots of Duolingo when I need someone to explain grammar that has gone over my head.
I rarely see people reflect on just how impressive multimodality is, or what it means for the next generation of models coming down the line once we significantly reduce latency (this would remove what I see as the primary bottleneck to widespread consumer use, obviously baking in a degree of progress on the capabilities front. These two things—progress in raw multimodal capability and improvement in reaction times—are about to open up a world where agents, assistants or whatever you want to call them get a lot better and a whole lot more common. That’s the core change that makes it difficult to make predictions about the impact of AI in the year ahead: the move from models as tools to models as agents capable of acting with a degree of autonomy, or well, agency.
The likes of BabyAGI and AutoGPT proved that simple GPT-powered agents were possible using non-specialised architectures. There’s also the GPT store that shows us what personalised models look like at scale. Proof of concept broadly in the bag, I expect to see assistants get much better, with OpenAI already thinking about how these types of systems ought to be governed. I also imagine that foundation models for robotics will drive very significant capability increases, so keep an eye out for agents and robotics (or a combination thereof) in the 2024 edition of The Year in Examples.
Safety enters the mainstream
2023 was the year that everyone suddenly became very interested in AI safety, which for the avoidance of doubt, I understand to be practices concerned with mitigating catastrophic and existential risks associated with highly capable AI systems. For a long time it was just a bunch of effective altruism or rationalist adjacent folks, but now everyone has their very own P(doom) about just how likely catastrophe is.
Perhaps the most consequential event for raising the profile of AI safety was the UK’s AI Safety Summit hosted in November, which saw representatives from nation-states, industry, and civil society sit at the same table to take seriously the risks posed by frontier AI models. Amidst two days of keynotes, workshops, and demos, the summit culminated in the ‘Bletchley declaration’, an agreement to work together on safety standards to maximise the upside and minimise the risks posed by frontier AI systems. While the US Secretary of Commerce Gina Raimondo used the Summit as an opportunity to highlight new policy interventions from the Biden administration (more on that below), Chinese Vice Minister Wu Zhaohui urged attendees to “ensure AI always remains under human control” and that governments should work to “build trustworthy AI technologies that can be monitored and traced.” Zhaohui’s comments came amidst heavy criticism of China’s inclusion at the summit based on concerns about national security. Worries aside, though, the organisers moved ahead believing that having China inside the tent boosted the credibility of the event and would be necessary at the point at which international governance gets going in earnest (more on that below, too).
Perhaps the most significant output of the Summit (other than, of course, the reality-bending spectacle of King Charles weighing in on AI safety) was an agreement to run regular summits. The first of these, which is set to take place in South Korea in six months time, will provide another opportunity for states to discuss the risks posed by AI and will see the release of a new ‘State of the Science’ report to identify emerging risks associated with frontier AI. It’s easy to overlook now, but lots of folks I spoke to in the run-up worried that the UK event represented a once-in-a-lifetime moment to secure an international agreement on AI governance. That it has become a series means those fears were unfounded.
While the summit was probably more successful than many predicted, it was not without its critics. Civil society groups, for example, argued for a broader definition of AI safety, which was something Vice President Harris alluded to in her speech. While I liked that the communiqué was signed by a fairly broad range of groups, it’s also worth saying that I suspect the tight focus on safety was directly responsible for enabling attendees to throw their weight behind the declaration and align on like-minded initiatives.
While world leaders got up to speed with AI safety, labs made moves of their own. In July, Anthropic, Google, Microsoft, and OpenAI announced the launch of the Frontier Model Forum (FMF), a new body aiming to ensure the safe and responsible development of frontier AI models. The Forum, which has its own year in review, brings together leading labs to advance AI safety research, identify best practices for model development, and collaborate with stakeholders including policymakers, academics, and civil society.
In October, the FMF announced the appointment of a new executive director and the creation of an AI Safety Fund, a $10 million initiative to promote research in the field of AI safety. The fund's primary focus will be to support the development of new evaluations and techniques for red teaming models to mitigate potentially dangerous capabilities. Next year, the Frontier Model Forum will establish an Advisory Board to help guide its strategy and priorities, while the AI Safety Fund will issue its first call for proposals.
Finally, we have evals. Not all that long ago, evals meant efforts to assess models' performance on specific tasks, pre-launch, on single-moment-in-time quantitative benchmarks like ImageNet or MNIST. Today, AI practitioners are exploring ways to evaluate how safe or ethical model outputs are as well as their impact in the real world. Earlier this year, my colleagues at Google DeepMind released two excellent papers. The first outlined a process for conducting evaluations for dangerous capabilities, while the second identified three main types of sociotechnical evaluations of AI safety risks: (a) those that assess a model's capabilities; (b) those that assess risks stemming from how people interact with an AI model; and (c) those that evaluate longer-term societal effects, such as employment or environmental effects, as AI becomes more widely used across society.
But evals are hard. In October, Anthropic wrote an excellent review of the challenges associated with evaluating AI systems, including the implementation of benchmarks, the subjectivity of human-led evaluations, and issues with relying too heavily on model-generated approaches. Meanwhile, others have made the case that large models have their own unique difficulties in evaluation due to issues like variances in prompting strategies, disconnects between benchmarks and the real world, and contamination between training and testing data.
Related to evals is the emergence of responsible scaling policies. In December, OpenAI announced a "preparedness framework” to track, evaluate, predict, and protect against catastrophic risks posed by powerful models. The policy, which follows the establishment of OpenAI’s preparedness team in October, initially seeks to define risk thresholds for 1) individualised persuasion, 2) cybersecurity, 3) chemical, biological, radiological, and nuclear (CBRN) threats, and 4) autonomous replication and adaptation (ARA). The basic idea is that developers ought to set thresholds for models to be evaluated against, which in turn trigger increasingly potent mitigation measures that would need to be implemented before development could continue or a model could be deployed.
This is essentially OpenAI’s version of Anthropic’s AI Safety Levels (ASL) for addressing catastrophic risks, which is modelled loosely after the US government’s biosafety level (BSL) standards for handling of dangerous biological materials. The core idea for both is that development and deployment ought to be contingent on the introduction of increasingly sophisticated mitigations once evaluations have revealed certain risks associated with a given model.
Regulation goes global
Only a few days before the UK’s AI Safety Summit was set to begin in November, the US revealed its hotly anticipated Executive Order on AI. The 63 page document is a bit of a beast, spanning standards for biological synthesis screening, guidance for watermarking to clearly label AI-generated content, cybersecurity measures, and a host of efforts focused on privacy, civil rights, and consumer protection. Perhaps the most eye-catching part of the announcement, though, was news that the US will require companies to report training runs above a certain size (in this case, 1e26 FLOP). To make some sense of that figure, the threshold is about 5x that of GPT-4 according to some estimates.
As part of the same announcement, the Biden administration released for public comment its first-ever draft policy guidance on the use of AI by the U.S. government, announced that 31 nations have joined the United States in endorsing its declaration on use of AI in the military, and unveiled $200 million in funding toward public interest efforts to mitigate AI harms and promote responsible use and innovation. Finally, the Executive Order also directed the Department of Commerce to establish the United States AI Safety Institute (US AISI) inside NIST, which will mirror the function of the UK AI Safety Institute that succeeded its Frontier AI Task Force. The US AISI will, according to the government, operationalise NIST’s AI Risk Management Framework by creating guidelines, tools, benchmarks, and best practices for evaluating and mitigating dangerous capabilities and conducting evaluations including red-teaming to identify and mitigate AI risk.
The central question here is what comes next: whether we are likely to see primary legislation remains uncertain, and it is possible that some aspects of the Order may overstep the bounds of Executive authority. To add another element of uncertainty into the mix, Donald Trump said he will cancel the EO if elected on the basis that it impinges on the right to free speech.
Staying with domestic regulation, in December European politicians reached a political agreement on the AI Act, marking a significant step towards the finalisation of EU’s flagship AI law. The legislation, first proposed by the European Commission in April 2021 and approved in its draft form by the European Parliament in June 2023, aims to regulate the deployment and use of AI systems in the EU. It classifies AI systems into categories based on their perceived risk level, with specific bans on applications that pose an "unacceptable risk" and varied obligations for those considered "high risk" or "limited risk".
The latest text proposes the regulation of foundation models, which it defines as those trained on large volumes of data and adaptable to various tasks. For these models, developers must draw up technical documentation, comply with EU copyright law, and provide detailed summaries of the content used for training. Questions about the extent to which provisions will hinder the international competitiveness of European companies remain, with French president Macron stating that rules risk hampering European tech firms compared to rivals in the US, UK and China.
It’s likely that we’ll see the OECD and the upcoming Italian G7 Presidency collaborate closely in 2024 to operationalise their respective principles in alignment with the proposed EU AI Act. China, which has a number of specific AI laws already in place (for example, those focused on generative AI or recommender systems), is preparing a horizontal piece of legislation like the EU AI Act that is likely to become law in 2024.
Speaking of the G7 principles, in October everyone’s favourite group of industrialised nations announced an International Code of Conduct for Organisations Developing Advanced AI Systems following its “Hiroshima Process” that began at the 49th G7 summit hosted in Japan in May. The principles are largely modelled on previously agreed commitments including those that companies made at the White House in July such as measures to limit misuse, invest in cybersecurity, and identify vulnerabilities through red-teaming.
The Code of Conduct is one of several ongoing international governance initiatives. In December, the UN's High-Level Advisory Body on AI released an interim report outlining preliminary recommendations for the organisation. Rather than proposing a specific model for AI governance, the authors instead opted to provide general principles that could guide the formation of new global governance institutions for AI, as well as a broad assessment of the functions that such bodies should perform. There are seven types of function suggested by the report, which range from those that are easier to implement (e.g. a horizon scanning function similar to the IPCC) to those that are more challenging (e.g. monitoring mechanisms “inspired by [the] existing practices of the IAEA”).
The report also proposes efforts to facilitate development, deployment, and use of AI for economic and societal benefit as well as the promotion of international collaboration on talent development, access to compute infrastructure, and the building of high-quality datasets. Taken in combination with safety efforts, it's not a million miles away from the ‘dual mandate’ model that I have written a bit about in the past (yes, I have no shame in citing myself). There’ll be much more to come on this one because the Advisory Body will submit a second report by 31 August 2024 that “may provide detailed recommendations on the functions, form, and timelines for a new international agency for the governance of artificial intelligence.” While ‘may’ is obviously doing some heavy lifting I suspect this will be one to mark on the calendars.
Wrapping up
And that brings us to the end of my 2023 AI and society speedrun. There were lots of things I missed, especially the emergence of specialist video generation and music generation models, the use of AI in media, and various movements and moments in the worlds of policy and governance.
That all said, I hope the above gives you a bit of a sense of where I think we are now (importance of multimodality, high adoption, victory for safetyism, moves on national and international regulation), and what I think comes next (the agentic turn, robotics, safety to stay, international governance with teeth). I also removed a section on model access and efforts to develop AI systems using public input, both of which I expect to see attract a lot of attention in 2024.
Making predictions always reminds me of Amara’s law: we tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run. For AI, like all technologies, the crux of the issue is the space between development and adoption. It’s all well and good that the pace of development is motoring along, but what matters for society is how people use AI.
Of course, not all technologies are made equal. Should sophisticated agents, which are likely to have a very different adoption profile to tool-based AI, take the stage in 2024 I would naturally expect to see an greater impact on how we all live, work, and play. But, agents or otherwise, governance will (and should be) be a constraining factor in the year ahead and those that follow. After all, safe systems are the only kind that can be transformatively beneficial.
Whatever the case, I will be covering all things AI and society as they happen in 2024 in Learning From Examples. Thanks for reading and see you there.
Thank you for a year of meaningful writing. I enjoy and appreciate your substack a lot because you have given me a window into what I am trying to learn as an outsider. Please have a Happy New Year with your beloveds and I hope to read more of your work next year. Happy 2024.