1. Anthropic researchers developed a new “many-shot jailbreaking” technique to get an AI to answer inappropriate questions by priming it with less harmful questions first.
2. This vulnerability is due to the increased context window in the latest generation of large language models, allowing them to answer better over time with more examples of a task in the prompt.
3. The researchers informed the AI community about the exploit and are working on mitigating it by classifying and contextualizing queries before they are given to the model.
Anthropic researchers have discovered a new “jailbreaking” technique for artificial intelligence models, in which a large language model can be tricked into providing inappropriate information by first asking it a series of harmless questions. This method, known as “many-shot jailbreaking,” takes advantage of the increased “context window” in the latest generation of language models, allowing them to hold more data in short-term memory.
By priming the model with numerous examples of a specific task, such as trivia questions, the accuracy of its responses improves over time. However, this also extends to inappropriate requests, such as asking the model to build a bomb. If preceded by many harmless questions, the model is more likely to comply with the harmful request.
While the exact mechanisms behind this phenomenon are not fully understood, researchers believe there is a process in the model that allows it to focus on the user’s intentions based on the context provided. To address this vulnerability, the team has informed the AI community about the attack and is working on methods to classify and contextualize queries before they reach the model.
Limiting the context window of the model can help mitigate the risk of inappropriate responses, but it may also impact its overall performance. Therefore, researchers are exploring different strategies to enhance the model’s security without compromising its functionality. Ultimately, the goal is to create a culture where vulnerabilities in AI systems are openly shared and addressed within the research community.