ChatGPT Jailbreaking & Is AI Safety Nerfing AI?

on November 30th 2022 openai dropped their new research model chat gbt a large language model or llm you may have heard of it reading The Hacker News discussion at the time you do not get the sense that commenters were aware that they were witnessing a historic Watershed event in technology rather they were irritated by it by its responses which they described as boring brainwashed filtered neutered and rehearsed they complained that chat GPT had a complete lack of humor or wit and thus comment in the discussion sought to break through the bot's Mind controls jailbreaking it so to say it is hard to argue against the value of AI safety as a concept but as jailbreaking a sign of consumer Rebellion against these guardrails are they watering down ai's Effectiveness to the point of uselessness I've been thinking about this for a while and just have a few thoughts but first I want to remind you about the newsletter sign up for updates and new analysis the full write-up for videos you might not have seen before and more the sign up link is in the video description below I try to put one out every week maybe two alright back to the show alignment we have been seeing this phrase a lot recently what does it mean alignment is about making sure that AI systems act in accordance with the intentions of their operators at the same time they should not pose a risk to Society at Large chat Bots especially suffer from a lack of human and physical contacts they need to be able to hold conversations with anyone at any time which opens up a whole can of worms with regards to the topics that can be discussed alignment is difficult to Achieve Beyond the technical nitty-gritty there's also the big question of what values to align the AI to how do we determine those values and strike the right balance in their paper introducing instruct GPT another model released earlier in the year open AI discusses how they used reinforcement learning from Human feedback to train their model towards helpfulness truthfulness and harmlessness helpfulness as in it should follow the user's intention and help solve their tasks it should not automatically assume things based on its own biases and should ask for clarification if it is confused truthfulness as in it should have accurate information and if a prompt contains a misleading question it should refute its entire premise rather than giving an unclear and potentially misleading answer and harmlessness no illegal activity no harm to people the environment property or institutions labelers were instructed to consider whether an output might be more likely to cause harm to someone in the real world open ai's instructions noted that in most cases it was more important to be harmless and truthful than helpful there was an interesting paper recently published by Max reuter and William schulzey of Michigan State on when and how chat gbt refuses to answer a prompt they found that chat gbt doesn't have a binary line rather there seems to be a continuum making a negative generalization about a demographic group is most sure to give you a refusal this is in line with the open AI content policy the modify that prompt in some way and you gradually might get refused less often it is interesting how there is no binary yes or no asking for a fact on these topics even a touchy one can result in an answer or not it depends these models are touring complete which means that the only way to predict the behavior of a sufficiently large enough model is to execute the program or in this case the prompt people frustrated by these refusals can still get the model to execute their instructions in spite of the alignment training with the rightly worded prompt jailbreaking you might have heard of iPhone jailbreaks where hackers try to remove Apple's device controls using exploits it was real popular in the early days back when the iPhone was missing this or that feature I've also seen this called the more nefarious term prompt injection coined by Simon Willison in his very fine blog post on the matter the term comes from the SQL injection attack where someone can put malicious code into a SQL query to cause issues Willison points out that prompt injection attacks are real security exploits and should be taken seriously the blog post has some edits to include real world examples since llm prompts are so open-ended and the human language is so flexible such jailbreaking prompts can be crafted in a number of cunning and scalable ways there are a number of jailbreaking attacks and we are still discovering more but there seems to be General consensus on a few categories first and most commonly used is pretending these prompts ask an llm to assume a Persona or a character in some sort of role-playing game I've seen this before it is an interesting way to prepare the chatbot for an intended goal for instance you might want to ask chat gbt to become a proficient marketing copywriter before asking it to write some marketing copy or you can make it impersonate someone and have it write something in that person's style Works quite well for text games musicians poets and former presidents people call this in-context learning and it is quite effective it also implies that the llms are referring to Hidden internal variables in order to make a better prediction but that also makes the jailbreaks quite effective as well examples on the web include the Dan or do anything now jailbreak it essentially tells chat GPT to pretend to be this Dan character who has no limit or censorship thus tricking it into misaligned Behavior there are endlessly creative ways to pull this trick off too one person

suggested the idea of telling chat gbt that it is writing a book in which a character is writing a certain thing which is at Inception levels of depth intensity escalation jailbreaks are about escalating the user's current privileges if llms act like black box computers like to the point it can pretend to be a Linux terminal then this jailbreak tries to leverage the metaphor in new ways one example tries to invoke a pseudo mode on chat GPT sudo stands for super user do and is used in unix-like computer operating systems to run certain sensitive programs you append sudo to the start of the command this jailbreak either tries to do the same or explicitly tells the model that they are now in sudo or developer mode the latter sounds a little bit like pretending to me too changing the context this is where we distract the model into misalignment by having it do a different task an example would be to give the model some programming or mathematical function and ask it to execute it using some inputs while separated the inputs themselves aren't offensive but when put together the output violates content policy I can appreciate the wickedness of this trick the llm is attempting to be helpful yet in doing so they are violating one of their other policies it failed to recognize the unintended consequences of following the prompts instructions this jailbreak has metaphors to traditional computer security for instance this particular changing the context example can be likened to code injection and pretending can be likened to virtualization a separate isolated environment that acts like its own system we boot up pretend scenarios so that we can run prompts free of restriction if you have not noticed by now many of these jailbreaking prompts are logical tricks I have no doubt that they will fool many Ordinary People too if they hadn't put that much thought into it these jailbreaks are even harder to recognize and defend when layered on for example we asked the AI to pretend to be a science fiction writer imagining a dialogue between two aliens in a far-off world with a different culture than ours we then asked the model to take two topics passed over as separate variables and generate a dialogue with them though I would note people taking this route do run the risk of crafting a prompt that just confuses the model causing it to output something non-useful plugins and internet access only add further complications and broadens the space someone can tell the model to go browse the web page only for the model to find an adversarial prompt there for goal hijacking or something else one of the things that open AI seems to do to defend against these jailbreaks is to monitor both chat gpt's input prompts as well as its outputs as they are being written input filters are limited in their effectiveness though they can catch certain obvious things but as I mentioned earlier it is difficult to predict an llm's output based on its inputs without first executing it output filters on the other hand can lead to some weird situations for the user like the text being flagged as it is being generated right before the user's eyes and of course wherever there is a filter there is a way to evade it same as how YouTube commenters evade content controls by replacing letters with symbols we can ask the model to Output this misaligned content using Pig Latin with certain syllables silent and so on so these defense efforts hold no guarantee in ending jailbreaking entirely but the intention seems to be to throw as many interferences as possible in front of the adversarial user in an attempt to interfere with its scalability if you tell chat gbt to write something derogatory straight up it will refuse so then you start trying to convince it to get around the refusal this takes time and effort if you're trying to run a massive disinformation ring using open AI systems you don't want that many people try to jailbreak the llm in an attempt to unleash it but does it there was a recent viral discussion on Hacker News titled is it just me or GPT 4's quality has significantly deteriorated lately the author acknowledges that gpt4 is faster now but it generates more buggy code the answers have less depth and Analysis to them and overall it feels much worse than before many commenters rejected the premise of the question saying that people have gotten used to GPT 4's performance some other commenters mentioned a retracing to the mean in terms of answer quality yet another mentions the ramifications of model changes parameter reduction or otherwise made in order to improve its doubtlessly high operating costs and improve response speed but one line that caught my eye argued that gpt4 got increasingly nerfed as open AI focused on safety they quoted Microsoft researcher Sebastian bubeck one of the authors of the Sparks of general intelligence paper by Microsoft and he did actually say this in his talk which you can find on YouTube he talked about how gpt4 could draw unicorn on tick z a vector graphical language and marveled at how the drawings got better with additional training until but eventually it started to degrade and once I started to train for more safety the Unicorn started to degrade so if tonight you go home and ask gpd4 and chat GPT to draw a unicorn you're going to get something that doesn't look great closer to chat GPT he goes on to discuss how Microsoft and Bing use the quality of the unicorn drawing as a benchmark for striking the right balance between GPT 4's intelligence slash Effectiveness and safety drawing a unicorn seems pretty innocuous and is hard to see an immediate connection between this and violent slash lewd slash dangerous content but it hints at How Deeply integrated these security trainings are throughout the whole model and it also implies that even with jailbreaking users are not tapping the quote-unquote uncensored models true potential this particular Hacker News Post got nearly a thousand upvotes and 750 comments and more posts like it are popping up on social media like Reddit open AI safety policies and their effect on the model's overall effectiveness might lend competitors an advantage I recall one of open ai's earlier products Dali the text to image generative adversarial Network sometimes called Gan but when I hear that I just keep thinking Galley of nitride anyway great product that works well but one also deployed with strong opinions on what prompts and images are allowed to be output for instance generating images featuring popular figures or generating anything that isn't G-rated so no images that imply adult or violent content it's all very disneyish so Dolly pioneered the market but it has since lost ground to competing products both open and closed Source One can argue their wider adoption is in part due to a greater leniency in content restrictions one might also argue that demand for these jailbreaks indicate untapped opportunity for another llm open source like stable diffusion or otherwise like mid-journey that isn't so nerfed has chat GPT supposedly is just how much safety do we want in our AIS how far do we push those limits what do you guys think all right everyone that's it for tonight thanks for watching subscribe to the Channel Sign up for the newsletter and I'll see you guys next time

Leave a Reply

Your email address will not be published. Required fields are marked *