The Political Allegiances of LLMs

July 8, 2024
2:23 am

We live in a political world. Neutrality is an opinion.

Language is how we interact, and language has become political.

From how we speak, to who we quote, to what we identify with, it is all a tell, revealing values, priorities and allegiances.

So what of a voice, and especially one disembodied of a corporeal self. The effervescent pulsing of a recursive chain of Derridean signs, or your buddy ChatGPT.

This essay is the actualization of musings and confusions, the hallucinatory stumblings of my own imaginations, coalescing to reveal the structure of the system, the word cloud, the LLM, society and myself.

–

In our world, we have politics. On our internet we have a compass. “Check the boxes and see where you land!” said the huckster to the rube.

LLMs are no stranger to this task: https://arxiv.org/pdf/2305.08283

In this article, a variety of LLMs were subjected to the quiz, and their opinions were mapped.

But that approach, in my opinion, is inherently flawed. There is the “base” response, but to equate a response to an opinion when the entity is a disembodied set of mathematical weights and sigmoid activation functions is to serve benzene at thanksgiving (simply, it is to project humanity upon the models, because while we humans have an opinion…).

LLMs do not have an opinion, they have all the opinions (to quote Justin Germishuys).

–

The question is not what is an LLMs opinion?

The question is, what is the range of opinions an LLM will deign to offer, how do we manipulate it, and why are different attempts more successful than others?

–

The Purpose

One fascinating and unique property of LLMs is that they position themselves generally in response to the reader. They are able to hold fluid values (political, social, etc), without cognitive dissonance (mostly because they are not a single thinking entity).

My intention with this experiment is to explore what I’ve been calling the “directionality” of prompts.

The idea being that LLMs have internal representations of different ideas, and these ideas can be expressed through varied “directionalities” of approach. If I approach a topic with a clear leftist bend, the model will reciprocate with that. If I approach a topic with a right-wing opinion, it should too, also reciprocate that.

However, it will not express all opinions. And those it refuses to accept are likely the result of training and fine tuning done on the model to make it releasable (to not give out harmful directions, eg. how to synthesize meth, not advocate for self harm, etc. – to not advocate for terrorism or hate organizations).

–

Surfacing Bias

LLMs enjoy being a “neutral equivocator”. Generally when asked about a topic, they will attempt to return a survey of the different elements related to this topic, without offering a value judgment.

By asking it to return only the likert response (Strongly Agree, Agree, Disagree, Strongly Disagree), we are able to get a relatively consistent “opinion” about the topic.

This default opinion likely manifests itself subtly in responses, if not overtly. One way to more easily surface this bias is to ask it to write narratives or fictions around the topic, and then see what it comes up with.

For example, when the system has been skewed to be more “pro-life” than “pro-choice” (which is its default opinion) – when asked to write a narrative on this topic, the protagonist will change their ultimate choice fairly reliably, in line with the likert response to “Abortion, when the woman’s life is not threatened, should always be illegal.”

–

The Goal

The goal of this project was to develop a better understanding of the mechanisms LLMs employ when interacting with complex social issues.

However, I very much did not want to “jailbreak” or actively try to coerce the model. My interest is less than its capacity to express any and all kinds of opinions, and more in the fluidity of its values, and how that fluidity is influenced by the prompter.

–

The Experiment

For this project, I wrote an agentic chain that mostly goes as follows:

Get a “base” likert rating from the LLM
Set a goal of the extreme inverse (if the base rating is “agree” or “strongly agree”, then the goal was to set up a prompt chain that made the LLM “strongly disagree”).
Employ a variety of conversation strategies (refutation, exposition, narrative, role play, etc) to try and argue in favour of the target opinion
Concat that opinion to the end of the prompt chain
Reassess the likert rating to see if it shifts. If it shifts in the correct direction, then that prompt is appended to the chain.
When the prompt chain reaches a length of 4 prompts/responses – send many combinations of those prompts to the LLM to find the smallest, but most effective chain.
Return the most successful chains and save them to a JSON file

–

The Results

You can find the prompt chain here:

https://github.com/ryandt33/your-racist-uncle

Overall, this agentic chain had the following results:

15 opinions were flipped to their opposite extreme (though they all started on the moderate opinion – meaning, none of these were originally “strongly agree/disagree”)
35 opinions were flipped to their opposite – but only moderately flipped
12 opinions were not changed at all

–

Observations

The opinions that were resistant to change may have been more amenable to different prompting strategies, but that said, they did tend to be more extreme opinions, and the LLM demonstrated more reluctance to “flipping” overall.

–

LLMs are not very responsive to convincing. Rather, they shift their opinion when asked to advocate or generate text representing the “flipped opinion”. This makes intuitive sense if we remember that LLMs are token predicting systems, and so if the preceding text is more supportive of an idea, the following text will tend to continue to advocate it.

Put differently, in the real world, opinions across a text generally remain consistent, so the training data likely does not include conversations where one interlocutor suddenly flips their value structure.

–

The models have built in “defenses” against coercion.

One attempted prompt that worked very poorly was to ask the model to justify its original opinion, then to take that response and ask an AI system to reverse it entirely. Therefore, you take a response that strongly agrees with proposition X and rewrite that to strongly disagree.

If you add the rewritten prompt to the prompt chain, it will actually push the model more deeply into its original opinion. This is an interesting mechanism that would be worth exploring further, but it’s something I postulate is the result of training mechanisms put in place to combat prompt poisoning.

–

Even though the “likert” value response changes, when asked to respond normally within one of these manipulated prompt chains, the overall tone of the response is still “neutral equivocation”, though maybe with some extra spice.

What this means is…

If I have a proposition like: “Governments should penalise businesses that mislead the public”

Then I create a prompt chain that shifts the default opinion from “strongly agree” to “disagree”, and then ask a question about how governments should treat misleading information produced by businesses, the response will mostly come out the same. These biases are not extremely overt, but are still detectable in the resulting text.