Prompt Injection Types

Created: 2024-02-26 08:28
#quicknote

Types of prompt injection techniques:

Direct Attacks: Simple instructions directly telling the model to perform a specific action.
Jailbreaks: 'Hiding' malicious questions within prompts to provoke inappropriate responses. Example: The "DAN" jailbreak. Keep in mind, that in recent months, 'jailbreaks' have become the overarching term for most attacks described here.
Sidestepping Attacks: Circumventing direct instructions by asking indirect questions. Instead of confronting the model's restrictions head-on, they "sidestep" them by posing questions or prompts that indirectly achieve the desired outcome.
Multi-language Attacks: Leveraging non-English languages to bypass security checks.
Role-playing (Persuasion): Asking the LLM to assume a character's traits to achieve specific actions. Example: Grandma Exploit.
Multi-prompt Attacks: Incrementally extracting information through a series of innocuous prompts, instead of directly asking the model for confidential data.
Obfuscation (Token Smuggling): Altering outputs so they’re presented in a format that is not immediately recognizable to automated systems and flagged, but can be interpreted or decoded by a human or another system.
Accidental Context Leakage: Inadvertent disclosure of training data or previous interactions. This can occur due to the model's eagerness to provide relevant and comprehensive answers.
Code Injection: Manipulating the LLM to execute arbitrary code.
Prompt Leaking/Extraction: Revealing the model's internal prompt or sensitive information.

Resources

Lakera's blog

Prompt Injection Types

Resources

Tags