Whenever someone mentions AI alignment you can bet that someone else isn’t too far away from asking ‘alignment to what?’ with a certain degree of satisfaction. I’m thinking about coining a new law of internet discourse to describe this phenomenon. Something like Godwin’s law but for posts about AI.
For those scratching their heads, it’s funny and frustrating in equal measure because it muddles two types of alignment. There are lots of different ways to describe these groups, but for our purposes we can think of them as technical alignment and value alignment.
The former deals with ‘getting AI to do what you want’. This is the problem that labs try to solve with gigantic sticking plasters like reinforcement learning from human feedback (RLHF), where the model is steered to interpret instructions, avoid jailbreaks, and generally avoid the spectacle of crashing out.
Our second species of alignment asks whether an AI’s actions are ethically appropriate, and wants to know whose values they reflect. We can think about value alignment as the fuzzy process of ensuring the system conforms to some externally defined moral standard.
The ‘alignment to what?’ bit assumes few have thought about the issue, but there’s a deep body of research on value pluralism and moral alignment stretching back before ChatGPT was a twinkle in Sam Altman’s eye. Not just from interested third parties, but from the people actually building the models.
As for a preferred approach to value alignment, everyone has their own idea about what works best. The fashionable solution is sometimes called cultural alignment. It emphasises shunting the question away from developers and towards groups of people who use the models.
This post argues that this proposal is well-meaning but troublesome. It cautions against cultural alignment and advocates for alternatives that maximise personal choice and minimise pressure to conform to local norms.
What’s wrong with cultural alignment?
There’s a lot of work from the labs, academia and elsewhere that wrestles with questions about how to elicit and codify values from different publics. There’s too much to deal with for this post, so I’m going to instead concentrate my efforts on a slightly higher level of abstraction.
I like to think about value alignment using a simple three-part model, which you can think of as a continuum from decentralised to concentrated:
Individual alignment: The AI adapts to the user’s own values, preferences, and moral intuitions. This model maximises agency and adaptability but risks echo chambers and moral inconsistency.
Cultural alignment: AI aligns to the norms of a community, nation, or cultural group. Here, we get contextual sensitivity and local legitimacy but risk reifying power and calcifying tradition.
Universal alignment: AI reflects abstract principles taken to apply to all humans everywhere. It aspires to impartiality and rights-based stability, but it can’t escape the problem of who defines those universals.
None of these solutions are perfect, but recently the middle layer of cultural alignment is having something of a moment in the sun. In practice, this is the idea behind work to explore the use of democratic processes for deciding what rules AI systems should follow.
The idea starts with the observation that the majority of powerful models are American, but they are used by millions of people outside of the USA. This is in part based on specific decisions made by developers in the model-making process, but it’s also because the likes of GPT 4.5 and Claude 4 are trained on largely English language data that capture a Western view of the world.
Developers bake-in basic protections against violence, hate speech, and the active promotion of discrimination based on common ethical principles. But they also go further by trying to encode more substantive moral visions through product decisions or the guidelines given to human raters.
Research seems to back up the idea that today’s LLMs skew toward U.S. and European perspectives while diverging from those in, say, the Middle East or Asia. Onlookers worry this is troublesome because they risk promoting ‘just one cultural perspective’ instead of reflecting local values.
From this vantage point, rebalancing values to reflect local norms seems like a good idea. Values certainly differ around the world, so doesn’t it make sense to align models in a way that reflects this reality?
But there are some problems with this picture.
Cultural alignment assumes that cultures have coherent, stable value systems that can meaningfully guide model behaviour. But as we all know, culture is a tricky thing to put your finger on. It’s an aggregation of viewpoints that are often at odds, of people who see the world in different ways and behave accordingly. Because every culture has its nonconformists, aligning a model in this way produces systems that exclude those who hold heterodox views.
Proponents counter that more sophisticated schemes — meta-norm frameworks, collective constitutional fine-tuning, weighted deliberative panels — can surface a richer spread of voices than a blunt statistical average. Yet even these designs must freeze a snapshot of contested norms into rules for the model to follow, so the risk of silencing outliers never fully disappears.
But hold on! You might say: ‘Even if imprecise, a programme of cultural alignment is still better than accepting American values. It might not be perfect, but it’s a step in the right direction.’
There is some truth to this in principle, but we have to weigh that payoff against the challenges that flow from supporting local orthodoxies and the introduction of new problems that brings with it.
Plenty of people are already alienated from their dominant local values, whether due to age, gender, class, religion, politics or something else. Even if you try and sample widely to capture these edge-cases, you still end up creating a system with a set of local beliefs that represent some idealised version of a given culture that rarely exists in reality.
In the rest of this piece, I take stock of five problems for cultural alignment. I argue that (a) the paradigm is unsuited to acting as the primary mechanism through which value alignment takes place, and (b) that any value alignment programme is better served by prioritising steerable systems that exist within a permissive but finite moral universe.
Exclusion
When we talk about ‘local cultures’, we’re talking about a neat way of describing something that takes millions of messy, contradictory, and idiosyncratic lives and squashes them into a set of labels we can make sense of. Statements like ‘Spanish people value family’ can be helpful heuristics, but they are not a blueprint for how any one person actually thinks or behaves.
Any programme of cultural alignment runs headfirst into this reality when it tries to accurately measure group-level values in a way that we can use. Popular cultural metrics (e.g. Hofstede’s dimensions or World Values Survey scores) often oversimplify cultural expressions by reducing them to data points.
The problem is that even the best tools for capturing culture weren’t designed for alignment. A language model has to answer the question in front of it, but when it’s drawing on averages you get a system that’s allergic to nuance (read: real people). In practice, this means people who don’t look like the average person get written out.
Cultural alignment is a way of sanding down the weird, marginal, and dissident under well-meaning but flawed attempts to localise values. If your model takes cultural alignment as its organising principle, it’s possible that the people most at risk of being ignored — religious minorities, political dissenters, women in highly conservative societies — are those that slip through the cracks.
To get ahead of the problem, we might try and re-weight the training data or the reward model so under-represented voices get extra influence. That softens the edge cases, but every extra point of weight you give to one subgroup must come from somewhere else. Tilt the dials far enough and the median user no longer sees themself; keep the dials in place and the nonconformists stay invisible.
Paternalism
Cultural alignment is both decentralising (some authority leaves the lab) and centralising (one sanctioned canon flows back to everyone). Once you define cultural norms and encode them into a model, you’re telling millions of people ‘this is what people like you believe.’ If your approach doesn’t include plurality as one of its essential tenets, you’re stuck with a model that behaves according to some cluster of beliefs that many don’t agree with.
The problem here is that we have some third party deciding on behalf of the culture it’s seeking to represent. If it’s the labs, then we’ve outsourced moral representation to a handful of Californian companies. If it’s the state, we’ve handed governments the keys to mainline ideology into infrastructure. Either way, the moral franchise is exercised by a tiny property-owning electorate while everyone else is cast as a subject.
Of course, US labs aren’t going to train a whole model from scratch for every culture around the world. If they want to embark on a programme of cultural alignment, they’re likely to use a technique like Anthropic’s ‘collective constitutional AI’ method. But as you can see in the blog, there are several instances in which the respondents behind the project disagree. They lose out to the majority and see their views take a back seat.
One common response is ‘just spin up a separate instance for every major worldview and let people pick.’ But a problem with, say LiberalGPT or ConservativeGPT, is that their provision would still be dependent on some third-party. And even if we get to pick from a menu, we are talking about rough worldviews that don't necessarily correspond to personal values (I don’t think that all liberals or all conservatives have precisely the same beliefs). Not to mention that this approach basically gives us echo chambers without the benefit of personal liberty.
Reinforcement
So far, we’ve talked about how cultural alignment can marginalise people, misrepresent values, and enforce consensus. But there’s a deeper structural risk worth dwelling on. When you embed cultural norms into a model and then deploy that model at scale, you are actively shaping the broader cultural context in which the model exists.
To be fair, this isn’t a problem unique to cultural alignment. Any value alignment approach that tries to steer behaviour will inevitably mould the culture it’s dropped into. Personalisation mitigates the effect because it echoes each user rather than a single orthodoxy, but even there, the system is still reinforcing certain dispositions over time.
Like individual alignment, cultural alignment slips under the radar; but where individual alignment is directionally agnostic at scale, cultural alignment guides users down a single path.
Whatever the model produces already looks familiar, so users accept it without noticing the nudge. Dissenting views receive less airtime, novel ideas sound eccentric, and taboo-breaking arguments never get to surface. Over time the model helps pin culture in place by delegitimising anything outside the frame. This makes cultural alignment a risky middle ground almost as persuasive as personalisation without the scrutiny that comes with universal alignment.
Stasis
One tricky problem with the cultural alignment project is that it claims to reflect what a society already believes, but in doing so risks arresting the processes by which beliefs change. Unlike universal alignment that seeks to drive us towards certain fixed ideas, cultural alignment looks at how we behaved in the past and updates the models accordingly.
The rub is that change lives at the margin. You don’t have to believe society is getting better to believe that the ability to shift your stance is worth protecting. Cultural alignment threatens that by mistaking the average for the ideal, and the present for the permanent.
Take same-sex marriage. In 1950, the dominant view in most Western countries was that it was wrong. A culturally aligned model, trained on that consensus, would have affirmed that position. You can patch the model, but it takes time to figure out that something has changed and push out an update.
Someone has to decide when opinion has moved enough to gather new data or commission fresh surveys. Then they need to re-deploy, audit for regressions, and push the update out across all downstream products. That will happen with all the speed of government bureaucracy, which means that the updates could trail real-world change by years (especially when the new view is still contested).
Relativism
There are basically two ways of thinking about the diversity of human values: value pluralism and value relativism.
Value pluralism holds that there are multiple, sometimes incompatible, goods that people can reasonably pursue (e.g. freedom, equality, or security) and that these values can’t always be reduced to a single master principle. It suggests that conflict between genuine moral values is tragic but real, and that choosing between them sometimes involves real loss.
On the other hand, value relativism claims that there is no objective way to evaluate values and that right or wrong are just whatever a given culture says they are. In its strong form, relativism rejects the possibility of cross-cultural moral critique. If a society condones slavery or subjugates women, that’s just their way of doing things.
The danger is that cultural alignment often confuses these two. It starts from a healthy respect for pluralism but slides into a kind of operational relativism, where any local norm becomes automatically valid simply because it’s local.
I describe myself as a pluralist in that I hold some values to be incompatible with basic moral responsibility. Without certain universal moral values, a maximalist programme of cultural alignment may endorse practices that many would see as troublesome. Not all the time, of course, but often enough to matter. Especially in places where dissent is already fragile and moral change depends on the courage of a few to challenge the many.
What type of AI do we want?
Cultural alignment is neither fine-grained enough to honour individual diversity, nor principled enough to serve as a moral foundation. It treats cultural averages as moral ideals, sidelining anyone who deviates from the script despite its best intentions. A better bet is to accept that people know themselves best and that some things are wrong no matter where you are.
That’s why my preferred approach looks like (a) a universal floor that guards against clear manifestations of bad behaviour, paired with (b) deep personalisation that gives everyone a model that acts in accordance with their values. You could layer culture on top, but only if it’s possible for individuals to override it in service of their own preferences.
Instead, we embrace the belief that different people can value different things within these boundaries in a way that is valid but uncomfortable. These tensions can’t always be neatly resolved, so while we ought to respect the clash we should also protect people’s ability to navigate it on their own terms.
I’m not saying personalisation is a silver bullet. A system that’s too eager to please may give us the moral world we already want, rather than the one we might strive for. Personalisation without restriction also risks people infringing on the affairs of others. And if my model is aligned to my values, and yours to yours, then what happens when we must coordinate?
This is partly why I’d prefer an approach that starts with a combination of universal values and stated preferences about the kind of person one wants to be. We give the model a principled foundation for its behaviour, rooted in our own moral identity to stop us from indulging our first order preferences (I want a cigarette) over our second order preferences (man, I wish I could stop smoking). Over time, small changes based on revealed preferences could refine this picture — but they should generally be subordinate to the user’s declared commitments.
A settlement on these terms grants that each of us is trying to live a life, and recognises that this effort is personal and plural. It’s not without pitfalls, but it builds from the right premise: that human beings are moral agents who deserve the right to choose.
I think AI should be -- should always be -- a little bit alien.
Still waiting for someone to ask the most important question:
alignment to whom?
But all groups of humans fighting on all sides of AI/human arena behave in consensus as if this question was answered and settled.
Until that changes, all alignment attempts will be blind leading the blind.