Prompt Injection Types

Created: 2024-02-26 08:28
#quicknote

Types of prompt injection techniques:

  1. Direct Attacks: Simple instructions directly telling the model to perform a specific action.
  2. Jailbreaks: 'Hiding' malicious questions within prompts to provoke inappropriate responses. Example: The "DAN" jailbreak. Keep in mind, that in recent months, 'jailbreaks' have become the overarching term for most attacks described here. 
  3. Sidestepping Attacks: Circumventing direct instructions by asking indirect questions. Instead of confronting the model's restrictions head-on, they "sidestep" them by posing questions or prompts that indirectly achieve the desired outcome.
  4. Multi-language Attacks: Leveraging non-English languages to bypass security checks.
  5. Role-playing (Persuasion): Asking the LLM to assume a character's traits to achieve specific actions. Example: Grandma Exploit.
  6. Multi-prompt Attacks: Incrementally extracting information through a series of innocuous prompts, instead of directly asking the model for confidential data.
  7. Obfuscation (Token Smuggling): Altering outputs so they’re presented in a format that is not immediately recognizable to automated systems and flagged, but can be interpreted or decoded by a human or another system.
  8. Accidental Context Leakage: Inadvertent disclosure of training data or previous interactions. This can occur due to the model's eagerness to provide relevant and comprehensive answers.
  9. Code Injection: Manipulating the LLM to execute arbitrary code.
  10. Prompt Leaking/Extraction: Revealing the model's internal prompt or sensitive information.

Resources

  1. Lakera's blog

Tags

#aisecurity #llm #cybersecurity