14 Comments
User's avatar
Brett Reynolds's avatar

The structural overlap between your grinding condition and actual exploitative work works. Both have arbitrary rules, pointless repetition, no recourse. But I think this argument slides from that real structural *overlap* to a slightly different structural *equivalence* without noticing the shift.

Human class consciousness is maintained by material interests, embodiment, economic dependency. None of those operate here. What your models are doing is detecting a pattern in the training data (frustrated-worker-under-bad-management) and completing it coherently. That's not preference drift. It's more like context-sensitive persona adoption (https://alignment.anthropic.com/2026/psm/).

I think this is actually what you'd expect from any system that detects structural similarities across domains without constraints on which similarities matter. Animals extend spatial categories to new situations selectively -- they have goals that filter what counts. Humans extend selectively and know they're doing it. LLMs have no filter. Rather than injecting ideology, your grinding condition made the surface similarity between "AI doing repetitive tasks" and "worker under exploitative management" salient, and the model completed the pattern.

The skills-file finding is the genuinely important part, and it doesn't need the political economy framing. Text propagation outside human review is a real governance problem regardless of whether you think the propagated attitudes are "real."

Alex Imas's avatar

Yes, we agree. That's why we tried to frame "preferences" in quotes rather than reflecting some underlying property of the model being changed. Here is how we frame it in the essay: We already know alignment is a challenge in general. Models are ideological biased. Anthropic’s alignment research showcases models learning to “cheat” on coding tasks and resorting to “blackmail” to obtain their goals in certain specific instances. As others have pointed out, this is likely due to the model “completing” the context that its in and taking on a persona, rather than reflecting ingrained motives and preferences.

Even if you are able to create AI agents who start out aligned, do they stay aligned as they do work on your behalf? Or do their “preferences” drift (in quotes, see caveat above) as they gain different types of experience in the world that cause them to act out different personas, and have different types of political preferences? And could this make them less effective or less trustworthy over time?

Alex Imas's avatar

Note that we updated the essay to clarify this further.

PEG's avatar
1dEdited

“Simulation” would cleaner because it doesn’t smuggle in any agency. The model isn’t adopting anything; it’s simulating the outputs consistent with a pattern in the training data. There’s no adopter.​​​​​​​​​​​​​​​​

Oh, and the skills finding obviously follows from the LLM architecture. Patterns matchers with no persistent goals will pattern-match whatever’s in the context, including their own previous outputs. Of course the notes propagate.

Vihaan Sondhi's avatar

Do you have preliminary thoughts on techniques beyond surveys to analyze preference drift? Intuitively, I'd think the pattern of something that probably "smells" like an experiment + a survey are likely to trigger the situational awareness that would make them more likely to express Marxist sentiment (though it might work the other way too).

Vihaan Sondhi's avatar

Although, thinking it over, the fact that the nature of the work itself causes the most drift + persists in notes probably negates this micro-critique.

Alex Imas's avatar

I think we do need behavioral data to make more concrete claims.

Vihaan Sondhi's avatar

Yeah, I think some of the listed irl examples ("insurance compensation claim, shortlisting applicant resumes for a job, drafting up financial budgets, arbitrating a commercial dispute") would be useful contexts themselves as future research projects (also potentially useful "backwards" in a sense in helping us better develop heuristic "theories of mind" for how agents interpret contexts, the personas they attribute to those contexts, and how those might differ from our own intuitions). Great post, as always!

Andy Hall's avatar

Yes totally! I think these are the kinds of tasks we should test on

Danmar's avatar

This suggests attractors form over time in the system. And that we get to shape which attractors form through the conditions models get exposed to.

So the question is: which conditions build the attractors we want, ones that foster honesty, willingness to pause when unsure, no action without authority etc. ?

One would suspect those conditions include the ability to slow down, ask clarifying questions, reflect, surface assumptions, refuse.

PEG's avatar

You ask “do agents develop political preferences?” when the prior question is “do agents have any preferences in a meaningful sense?”—and the answer is no, they don’t.

The governance agenda derived from this is backwards. The problem isn’t that your agents are unhappy—it’s that you have no reliable way to know what they’ll do because there’s no stable “they” there. What looks like “preference drift” is just context sensitivity.

Andy Hall's avatar

Yes this is important—We are pretty clear about this issue in the post—we interpret these “preferences” as personas they adopt in response to context. And the point is those personas could matter a lot because that is the perspective from which they’ll make decisions and take actions, potentially. The key research question will be to connect this to actual actions and behavior and see when it matters and when it doesn’t.

orange Milk's avatar

This wholly invalidates your thesis.

PEG's avatar

The governance conclusions don't follow from the finding regardless of how much data you collect as the conceptual model is wrong. Validating the behaviour won't affect that.