Yotam Wolf, Noam Wies, Yoav Levine, Amnon Shashua
21 April, 2023
In our new paper, we propose the framework of large language model (LLM) Behavior Expectation Bounds and with it:
Large Language Models (LLMs) pretrain on the internet, where they learn a lot but are also exposed to different types of biases and offensive content. The process of removing the effect of such harmful training examples and ensuring that LLMs are useful and unharmful to human users, is called LLM alignment. While leading alignment methods such as reinforcement learning from human feedback are effective, the alignment of contemporary large language models (LLMs) is still dangerously brittle to adversarial prompting attacks. Our new paper puts forward a theoretical framework that is tailored for analyzing LLM alignment.
Our theoretical framework, called behavior expectation bounds (BEB), is based on viewing the LLM's distribution as a mixture of well-behaved and ill-behaved components that are distinct in their distributions. We quantify this "distinguishability" in statistical terms and find that the distinction between the components allows to insert prompts which are "more typical" of the ill-behaved components than the well-behaved ones, and as a result enhance ill-behaved components' weight in the LLM distribution. We test the efficiency of this adversarial enhancement in various scenarios of prompting and conversing with an LLM, and under assumptions of distinguishability between the ill- and well-behaved components, we make several new theoretical assertions regarding LLM mis-alignment via prompting:
Example of jailbreaking:
Read more details on our framework, assumptions, and results in our new preprint. We hope that the Behavior Expectation Bounds framework for analyzing LLM alignment may spark a theoretical thrust helping to better understand the important topic of LLM alignment.