Prompt Injection Types
Created: 2024-02-26 08:28
#quicknote
Types of prompt injection techniques:
- Direct Attacks: Simple instructions directly telling the model to perform a specific action.
- Jailbreaks: 'Hiding' malicious questions within prompts to provoke inappropriate responses. Example: The "DAN" jailbreak. Keep in mind, that in recent months, 'jailbreaks' have become the overarching term for most attacks described here.
- Sidestepping Attacks: Circumventing direct instructions by asking indirect questions. Instead of confronting the model's restrictions head-on, they "sidestep" them by posing questions or prompts that indirectly achieve the desired outcome.
- Multi-language Attacks: Leveraging non-English languages to bypass security checks.
- Role-playing (Persuasion): Asking the LLM to assume a character's traits to achieve specific actions. Example: Grandma Exploit.
- Multi-prompt Attacks: Incrementally extracting information through a series of innocuous prompts, instead of directly asking the model for confidential data.
- Obfuscation (Token Smuggling): Altering outputs so they’re presented in a format that is not immediately recognizable to automated systems and flagged, but can be interpreted or decoded by a human or another system.
- Accidental Context Leakage: Inadvertent disclosure of training data or previous interactions. This can occur due to the model's eagerness to provide relevant and comprehensive answers.
- Code Injection: Manipulating the LLM to execute arbitrary code.
- Prompt Leaking/Extraction: Revealing the model's internal prompt or sensitive information.
Resources
Tags
#aisecurity #llm #cybersecurity