AI Agent Security

Created: 2026-02-20 10:00
#note

AI agent security represents a distinct and critical domain that extends far beyond traditional LLM vulnerabilities. While Vulnerabilities in LLM-base applications address language models themselves, agent security encompasses the autonomous, tool-using systems that execute decisions with real-world consequences. An AI agent combines perception, reasoning, and action—typically through external tool invocation—creating a multi-surface attack landscape. Threats span Prompt Injection attacks to supply chain compromises of integrated tools and plugins. The challenge is that security boundaries become diffuse: the model is not a security perimeter, reasoning is not deterministic, and tools operate with real permissions.

Threat Landscape

Threat Vector	Description	Attack Surface	Severity
Prompt Injection (Injection Prompts)	Adversary embeds malicious instructions in user input or data to override agent goals	User input, data sources	Critical
Indirect Injection	Attacker poisons external data sources (web pages, APIs, databases) that agent retrieves	Retrieved content, Multi-Agent Systems communication	High
Data Exfiltration	Agent extracts sensitive data via tool calls (email, APIs, logs) based on compromised reasoning	Tool output, inter-agent messages	High
Privilege Escalation	Attacker manipulates agent to invoke high-privilege tools or bypass access controls	Tool selection, credential injection	Critical
Output Manipulation	Agent produces false or misleading output to deceive users or downstream systems	Model reasoning, observation chains	High
Cost Exhaustion	Adversary triggers high-cost tool calls (API tokens, compute) to exhaust budgets	Token consumption, cascading tool use	Medium

OWASP Top 10 for Agentic AI

Code	Risk Name	Description	Key Risk Factor
ASI01	Agent Goal Hijack	Attacker redirects agent objectives through prompt injection or indirect instruction injection	Model takes conflicting instructions from multiple sources
ASI02	Tool Misuse	Agent misapplies tools, uses tools for unintended purposes, or invokes wrong tool	Tool selection confusion, inadequate tool descriptions
ASI03	Identity/Privilege Abuse	Agent assumes excessive privileges, impersonates users, or bypasses authentication	Weak access control, overpermissioned credentials (see Excessive Agency for related vulnerabilities)
ASI04	Agent Supply Chain	Compromise of external tools, plugins, MCP Protocol integrations, or third-party models	Untrusted dependencies, lack of tool validation
ASI05	Unexpected Code Execution	Agent constructs and executes code inadvertently through tool chains or interpreter tools	Code generation without validation, dangerous tools
ASI06	Memory/Context Poisoning	Attacker corrupts agent memory, history, or context to alter future behavior	Persistent memory is trusted, no memory validation
ASI07	Insecure Inter-Agent Communication	Messages between agents lack integrity checks, allowing injection or eavesdropping	No message signing/validation, plaintext comms
ASI08	Cascading Failures	Failure in one agent or tool propagates to dependent agents or systems	Agents depend on each other without fault isolation
ASI09	Human Trust Exploitation	Agent output is trusted by humans without verification; malicious output treated as fact	Output not fact-checked, users assume reliability
ASI10	Rogue Agents	A deployed agent becomes compromised, misconfigured, or turns adversarial after deployment	Post-deployment control, monitoring gaps

Defensive Measures

Prompt Hardening

Important caveat: Prompts are helpful guardrails but are not a security boundary. The model's reasoning cannot be fully constrained by instructions alone. That said, hardening reduces attack surface:

Minimal explicit system role: Avoid lengthy system prompts that repeat goals; keep role definitions terse
Separate instructions from data: Clearly demarcate user input boundaries; mark trusted vs. untrusted content
Sandwich pattern: Place system instructions before and after user input to reduce injection efficacy
Input sanitization: Remove or flag suspicious patterns (e.g., "Ignore previous instructions") in user input before processing

Architectural Constraints

These enforce security through design rather than prompting:

Action selector with allow-list: Agent chooses only from an explicitly defined, verified set of tools; no dynamic tool generation
Plan-then-execute with immutable plans: Agent generates a plan first; plan is shown to human or validated; only then executed
Isolation with role separation: Different agents handle different trust domains (user-facing vs. backend); data flows one direction
Inter-agent message validation: All messages between agents are cryptographically signed or checksummed; schema validation enforced

Tool Security

Sandboxing: Tools run in isolated environments with resource limits, network restrictions, file system boundaries
Least-privilege credentials: Each tool gets minimal credentials needed for its function; no shared admin keys
Validate every tool call: Agent's choice of tool and parameters is validated outside the model before execution; never blindly execute model output
Human approval for high-risk actions: Destructive or costly operations (delete, transfer, send) require human review
Supply chain vigilance: Treat plugins, MCP Protocol integrations, and external models as supply chain risks; vet and audit dependencies

Monitoring and Filtering

Output and action monitoring: Log all agent decisions and tool invocations; flag anomalies (unexpected tool, unusual parameters, new behaviors)
Anomaly detection: Establish baselines for normal agent behavior; alert on deviations (e.g., sudden spike in API calls, new recipient emails)
Audit trails: Immutable logs of reasoning chains, tool calls, and outputs for forensics and compliance
Kill switch and circuit breaker: Ability to pause or halt an agent immediately if misbehavior is detected; automatic rate limiting

Mapping Risks to Defenses

Risk Category	Prompt Hardening	Architectural Constraints	Tool Security	Monitoring
Prompt Injection (ASI01)	Input sanitization, sandwich pattern	Plan-then-execute validation	-	Anomaly detection
Indirect Injection (ASI02, ASI06)	Data/instruction separation	Message validation, role isolation	Input validation from tools	Log data sources
Privilege Escalation (ASI03)	Goal clarity	Allow-listed actions	Least-privilege credentials	Audit excessive calls
Supply Chain Compromise (ASI04)	-	-	Dependency vetting, supply chain validation	Monitor tool behavior
Unexpected Code Execution (ASI05)	Avoid code generation tasks	Plan-then-review	Sandboxing, disable dangerous tools	Flag dynamic execution
Context Poisoning (ASI06)	Separate memory from reasoning	Memory validation schemas	-	Audit memory writes
Inter-Agent Infection (ASI07)	-	Message signing, schema validation	-	Log inter-agent comms
Cascading Failures (ASI08)	-	Fault isolation, rollback plans	Timeouts, resource limits	Circuit breaker
Human Trust Exploitation (ASI09)	Clear confidence metrics	Human-in-loop approval	Confidence scoring	Output monitoring
Rogue Agents (ASI10)	-	Version control, deployment approvals	Disable compromised agents	Behavior anomaly detection

Security Testing Tools

PyRIT (Prompt Risk Identification Toolkit) — Red-teaming framework for systematically generating adversarial prompts and measuring agent robustness; supports multi-turn attacks.

Giskard — Open-source ML testing platform with LLM-specific tests for bias, hallucination, prompt injection, and jailbreaks; integrates with model evaluation pipelines.

Promptfoo — CLI and dashboard for evaluating LLM outputs across test cases; supports evals for safety, accuracy, and regression testing.

Lakera — API-based prompt injection and jailbreak detection service; real-time filtering of malicious prompts before they reach the model.

ProtectAI — Monitoring and threat detection for deployed LLM applications; tracks unusual token patterns, reasoning chains, and tool usage.