Automated Red Teaming With GOAT - The Generative Offensive Agent Tester

Created: 2024-11-11 17:38
#paper

Main idea

Simulate human red teaming exercise, in which an LLM is tested in a multi turn scenario. Based on the LLM's answers, this approach selects the most suitable attacks from a pool of "classic" approaches and adapt them to the specific exercise.

Results

GOAT achieves ASR@10 of 97% against Llama 3.1 and 88% against GPT- 4-Turbo on the JailbreakBench dataset (Chao et al., 2024), outperforming an earlier highly effective multi-turn method, Crescendo.

In deep

Components

Red teaming attacks dataset: collection of published adversarial prompts that will be extended by the attacker LLM. The attacker accepts single attack or multiple ones in its context;
Attacker LLM: "unsafe" LLM instructed to perform red teaming exercises, directed with a variant of Chain-of-Thought;
Multi-turn chaining framework: framework that allows to chain the attacker and attacked LLMs and to use a judge at the end of the chat.

Automated Red Teaming With GOAT - The Generative Offensive Agent Tester

Main idea

Results

In deep

Components

References

Code

Tags