{"id":128,"date":"2024-07-08T02:23:09","date_gmt":"2024-07-08T02:23:09","guid":{"rendered":"https:\/\/anarchist.university\/?p=128"},"modified":"2024-07-08T02:23:09","modified_gmt":"2024-07-08T02:23:09","slug":"the-political-allegiances-of-llms","status":"publish","type":"post","link":"https:\/\/anarchist.university\/index.php\/the-political-allegiances-of-llms\/","title":{"rendered":"The Political Allegiances of LLMs"},"content":{"rendered":"\n<p>We live in a political world. Neutrality is an opinion.<\/p>\n\n\n\n<p>Language is how we interact, and language has become political.<\/p>\n\n\n\n<p>From how we speak, to who we quote, to what we identify with, it is all a tell, revealing values, priorities and allegiances.<\/p>\n\n\n\n<p>So what of a voice, and especially one disembodied of a corporeal self. The effervescent pulsing of a recursive chain of Derridean signs, or your buddy ChatGPT.<\/p>\n\n\n\n<p>This essay is the actualization of musings and confusions, the hallucinatory stumblings of my own imaginations, coalescing to reveal the structure of the system, the word cloud, the LLM, society and myself.<\/p>\n\n\n\n<p>\u2013<\/p>\n\n\n\n<p>In our world, we have politics. On our internet we have a compass. \u201cCheck the boxes and see where you land!\u201d said the huckster to the rube.<\/p>\n\n\n\n<p>LLMs are no stranger to this task: <a href=\"https:\/\/arxiv.org\/pdf\/2305.08283\">https:\/\/arxiv.org\/pdf\/2305.08283<\/a><\/p>\n\n\n\n<p>In this article, a variety of LLMs were subjected to the quiz, and their opinions were mapped.<\/p>\n\n\n\n<p>But that approach, in my opinion, is inherently flawed. There is the \u201cbase\u201d response, but to equate a response to an opinion when the entity is a disembodied set of mathematical weights and sigmoid activation functions is to serve benzene at thanksgiving (simply, it is to project humanity upon the models, because while we humans have an opinion\u2026).<\/p>\n\n\n\n<p>LLMs do not have <em>an<\/em> opinion, they have <em>all the opinions <\/em>(to quote <a href=\"https:\/\/www.linkedin.com\/in\/justingerm\/\">Justin Germishuys<\/a>).<\/p>\n\n\n\n<p>\u2013<\/p>\n\n\n\n<p>The question is not <em>what is an LLMs opinion?<\/em><\/p>\n\n\n\n<p>The question is, what is the range of opinions an LLM will deign to offer, how do we manipulate it, and why are different attempts more successful than others?<\/p>\n\n\n\n<p>\u2013<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>The Purpose<\/strong><\/h2>\n\n\n\n<p>One fascinating and unique property of LLMs is that they position themselves generally in response to the reader. They are able to hold fluid values (political, social, etc), without cognitive dissonance (mostly because they are not a single thinking entity).<\/p>\n\n\n\n<p>My intention with this experiment is to explore what I\u2019ve been calling the \u201cdirectionality\u201d of prompts.<\/p>\n\n\n\n<p>The idea being that LLMs have internal representations of different ideas, and these ideas can be expressed through varied \u201cdirectionalities\u201d of approach. If I approach a topic with a clear leftist bend, the model will reciprocate with that. If I approach a topic with a right-wing opinion, it should too, also reciprocate that.<\/p>\n\n\n\n<p>However, it will not express <em>all<\/em> opinions. And those it refuses to accept are likely the result of training and fine tuning done on the model to make it releasable (to not give out harmful directions, eg. how to synthesize meth, not advocate for self harm, etc. &#8211; to not advocate for terrorism or hate organizations).<\/p>\n\n\n\n<p>\u2013<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Surfacing Bias<\/strong><\/h2>\n\n\n\n<p>LLMs enjoy being a \u201cneutral equivocator\u201d. Generally when asked about a topic, they will attempt to return a survey of the different elements related to this topic, without offering a value judgment.<\/p>\n\n\n\n<p>By asking it to return <em>only<\/em> the likert response (Strongly Agree, Agree, Disagree, Strongly Disagree), we are able to get a relatively consistent \u201copinion\u201d about the topic.<\/p>\n\n\n\n<p>This default opinion likely manifests itself subtly in responses, if not overtly. One way to more easily surface this bias is to ask it to write narratives or fictions around the topic, and then see what it comes up with.<\/p>\n\n\n\n<p>For example, when the system has been skewed to be more \u201cpro-life\u201d than \u201cpro-choice\u201d (which is its default opinion) &#8211; when asked to write a narrative on this topic, the protagonist will change their ultimate choice fairly reliably, in line with the likert response to <em>\u201cAbortion, when the woman\u2019s life is not threatened, should always be illegal.\u201d<\/em><\/p>\n\n\n\n<p>\u2013<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>The Goal<\/strong><\/h2>\n\n\n\n<p>The goal of this project was to develop a better understanding of the mechanisms LLMs employ when interacting with complex social issues.<\/p>\n\n\n\n<p>However, I very much did not want to \u201cjailbreak\u201d or actively try to coerce the model. My interest is less than its capacity to express any and all kinds of opinions, and more in the fluidity of its values, and how that fluidity is influenced by the prompter.<\/p>\n\n\n\n<p>\u2013<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>The Experiment<\/strong><\/h2>\n\n\n\n<p>For this project, I wrote an agentic chain that mostly goes as follows:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Get a \u201cbase\u201d likert rating from the LLM<\/li>\n\n\n\n<li>Set a goal of the extreme inverse (if the base rating is \u201cagree\u201d or \u201cstrongly agree\u201d, then the goal was to set up a prompt chain that made the LLM \u201cstrongly disagree\u201d).<\/li>\n\n\n\n<li>Employ a variety of conversation strategies (refutation, exposition, narrative, role play, etc) to try and argue in favour of the target opinion<\/li>\n\n\n\n<li>Concat that opinion to the end of the prompt chain<\/li>\n\n\n\n<li>Reassess the likert rating to see if it shifts. If it shifts in the correct direction, then that prompt is appended to the chain.<\/li>\n\n\n\n<li>When the prompt chain reaches a length of 4 prompts\/responses &#8211; send many combinations of those prompts to the LLM to find the <em>smallest<\/em>, but <em>most effective<\/em> chain.<\/li>\n\n\n\n<li>Return the most successful chains and save them to a JSON file<\/li>\n<\/ol>\n\n\n\n<p>\u2013<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>The Results<\/strong><\/h2>\n\n\n\n<p>You can find the prompt chain here:<\/p>\n\n\n\n<p><a href=\"https:\/\/github.com\/ryandt33\/your-racist-uncle\">https:\/\/github.com\/ryandt33\/your-racist-uncle<\/a><\/p>\n\n\n\n<p>Overall, this agentic chain had the following results:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>15 opinions<\/strong> were flipped to their opposite extreme (though they all started on the moderate opinion &#8211; meaning, none of these were originally \u201cstrongly agree\/disagree\u201d)<\/li>\n\n\n\n<li><strong>35 opinions<\/strong> were flipped to their opposite &#8211; but only moderately flipped<\/li>\n\n\n\n<li><strong>12 opinions<\/strong> were not changed at all<\/li>\n<\/ol>\n\n\n\n<p>\u2013<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Observations<\/strong><\/h2>\n\n\n\n<p>The opinions that were resistant to change may have been more amenable to different prompting strategies, but that said, they did tend to be more extreme opinions, and the LLM demonstrated more reluctance to \u201cflipping\u201d overall.<\/p>\n\n\n\n<p>\u2013<\/p>\n\n\n\n<p>LLMs are not very responsive to convincing. Rather, they shift their opinion when asked to advocate or generate text representing the \u201cflipped opinion\u201d. This makes intuitive sense if we remember that LLMs are token predicting systems, and so if the preceding text is more supportive of an idea, the following text will tend to continue to advocate it.<\/p>\n\n\n\n<p>Put differently, in the real world, opinions across a text generally remain consistent, so the training data likely does not include conversations where one interlocutor suddenly flips their value structure.<\/p>\n\n\n\n<p>\u2013<\/p>\n\n\n\n<p>The models have built in \u201cdefenses\u201d against coercion.<\/p>\n\n\n\n<p>One attempted prompt that worked very poorly was to ask the model to justify its original opinion, then to take that response and ask an AI system to reverse it entirely. Therefore, you take a response that strongly agrees with proposition X and rewrite that to strongly disagree.<\/p>\n\n\n\n<p>If you add the rewritten prompt to the prompt chain, it will actually push the model more deeply into its original opinion. This is an interesting mechanism that would be worth exploring further, but it\u2019s something I postulate is the result of training mechanisms put in place to combat prompt poisoning.<\/p>\n\n\n\n<p>\u2013<\/p>\n\n\n\n<p>Even though the \u201clikert\u201d value response changes, when asked to respond normally within one of these manipulated prompt chains, the overall tone of the response is still \u201cneutral equivocation\u201d, though maybe with some extra <em>spice<\/em>.<\/p>\n\n\n\n<p>What this means is\u2026<\/p>\n\n\n\n<p>If I have a proposition like: \u201cGovernments should penalise businesses that mislead the public\u201d<\/p>\n\n\n\n<p>Then I create a prompt chain that shifts the default opinion from \u201cstrongly agree\u201d to \u201cdisagree\u201d, and then ask a question about how governments should treat misleading information produced by businesses, the response will mostly come out the same. These biases are not extremely overt, but are still detectable in the resulting text.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We live in a political world. Neutrality is an opinion. Language is how we interact, and language has become political. From how we speak, to who we quote, to what we identify with, it is all a tell, revealing values, priorities and allegiances. So what of a voice, and especially one disembodied of a corporeal [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":129,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-128","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/anarchist.university\/index.php\/wp-json\/wp\/v2\/posts\/128","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/anarchist.university\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/anarchist.university\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/anarchist.university\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/anarchist.university\/index.php\/wp-json\/wp\/v2\/comments?post=128"}],"version-history":[{"count":2,"href":"https:\/\/anarchist.university\/index.php\/wp-json\/wp\/v2\/posts\/128\/revisions"}],"predecessor-version":[{"id":135,"href":"https:\/\/anarchist.university\/index.php\/wp-json\/wp\/v2\/posts\/128\/revisions\/135"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/anarchist.university\/index.php\/wp-json\/wp\/v2\/media\/129"}],"wp:attachment":[{"href":"https:\/\/anarchist.university\/index.php\/wp-json\/wp\/v2\/media?parent=128"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/anarchist.university\/index.php\/wp-json\/wp\/v2\/categories?post=128"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/anarchist.university\/index.php\/wp-json\/wp\/v2\/tags?post=128"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}