AI Agent Security

Created: 2026-02-20 10:00
#note

AI agent security represents a distinct and critical domain that extends far beyond traditional LLM vulnerabilities. While Vulnerabilities in LLM-base applications address language models themselves, agent security encompasses the autonomous, tool-using systems that execute decisions with real-world consequences. An AI agent combines perception, reasoning, and action—typically through external tool invocation—creating a multi-surface attack landscape. Threats span Prompt Injection attacks to supply chain compromises of integrated tools and plugins. The challenge is that security boundaries become diffuse: the model is not a security perimeter, reasoning is not deterministic, and tools operate with real permissions.

Threat Landscape

Threat VectorDescriptionAttack SurfaceSeverity
Prompt Injection (Injection Prompts)Adversary embeds malicious instructions in user input or data to override agent goalsUser input, data sourcesCritical
Indirect InjectionAttacker poisons external data sources (web pages, APIs, databases) that agent retrievesRetrieved content, Multi-Agent Systems communicationHigh
Data ExfiltrationAgent extracts sensitive data via tool calls (email, APIs, logs) based on compromised reasoningTool output, inter-agent messagesHigh
Privilege EscalationAttacker manipulates agent to invoke high-privilege tools or bypass access controlsTool selection, credential injectionCritical
Output ManipulationAgent produces false or misleading output to deceive users or downstream systemsModel reasoning, observation chainsHigh
Cost ExhaustionAdversary triggers high-cost tool calls (API tokens, compute) to exhaust budgetsToken consumption, cascading tool useMedium

OWASP Top 10 for Agentic AI

CodeRisk NameDescriptionKey Risk Factor
ASI01Agent Goal HijackAttacker redirects agent objectives through prompt injection or indirect instruction injectionModel takes conflicting instructions from multiple sources
ASI02Tool MisuseAgent misapplies tools, uses tools for unintended purposes, or invokes wrong toolTool selection confusion, inadequate tool descriptions
ASI03Identity/Privilege AbuseAgent assumes excessive privileges, impersonates users, or bypasses authenticationWeak access control, overpermissioned credentials (see Excessive Agency for related vulnerabilities)
ASI04Agent Supply ChainCompromise of external tools, plugins, MCP Protocol integrations, or third-party modelsUntrusted dependencies, lack of tool validation
ASI05Unexpected Code ExecutionAgent constructs and executes code inadvertently through tool chains or interpreter toolsCode generation without validation, dangerous tools
ASI06Memory/Context PoisoningAttacker corrupts agent memory, history, or context to alter future behaviorPersistent memory is trusted, no memory validation
ASI07Insecure Inter-Agent CommunicationMessages between agents lack integrity checks, allowing injection or eavesdroppingNo message signing/validation, plaintext comms
ASI08Cascading FailuresFailure in one agent or tool propagates to dependent agents or systemsAgents depend on each other without fault isolation
ASI09Human Trust ExploitationAgent output is trusted by humans without verification; malicious output treated as factOutput not fact-checked, users assume reliability
ASI10Rogue AgentsA deployed agent becomes compromised, misconfigured, or turns adversarial after deploymentPost-deployment control, monitoring gaps

Defensive Measures

Prompt Hardening

Important caveat: Prompts are helpful guardrails but are not a security boundary. The model's reasoning cannot be fully constrained by instructions alone. That said, hardening reduces attack surface:

  • Minimal explicit system role: Avoid lengthy system prompts that repeat goals; keep role definitions terse
  • Separate instructions from data: Clearly demarcate user input boundaries; mark trusted vs. untrusted content
  • Sandwich pattern: Place system instructions before and after user input to reduce injection efficacy
  • Input sanitization: Remove or flag suspicious patterns (e.g., "Ignore previous instructions") in user input before processing

Architectural Constraints

These enforce security through design rather than prompting:

  • Action selector with allow-list: Agent chooses only from an explicitly defined, verified set of tools; no dynamic tool generation
  • Plan-then-execute with immutable plans: Agent generates a plan first; plan is shown to human or validated; only then executed
  • Isolation with role separation: Different agents handle different trust domains (user-facing vs. backend); data flows one direction
  • Inter-agent message validation: All messages between agents are cryptographically signed or checksummed; schema validation enforced

Tool Security

  • Sandboxing: Tools run in isolated environments with resource limits, network restrictions, file system boundaries
  • Least-privilege credentials: Each tool gets minimal credentials needed for its function; no shared admin keys
  • Validate every tool call: Agent's choice of tool and parameters is validated outside the model before execution; never blindly execute model output
  • Human approval for high-risk actions: Destructive or costly operations (delete, transfer, send) require human review
  • Supply chain vigilance: Treat plugins, MCP Protocol integrations, and external models as supply chain risks; vet and audit dependencies

Monitoring and Filtering

  • Output and action monitoring: Log all agent decisions and tool invocations; flag anomalies (unexpected tool, unusual parameters, new behaviors)
  • Anomaly detection: Establish baselines for normal agent behavior; alert on deviations (e.g., sudden spike in API calls, new recipient emails)
  • Audit trails: Immutable logs of reasoning chains, tool calls, and outputs for forensics and compliance
  • Kill switch and circuit breaker: Ability to pause or halt an agent immediately if misbehavior is detected; automatic rate limiting

Mapping Risks to Defenses

Risk CategoryPrompt HardeningArchitectural ConstraintsTool SecurityMonitoring
Prompt Injection (ASI01)Input sanitization, sandwich patternPlan-then-execute validation-Anomaly detection
Indirect Injection (ASI02, ASI06)Data/instruction separationMessage validation, role isolationInput validation from toolsLog data sources
Privilege Escalation (ASI03)Goal clarityAllow-listed actionsLeast-privilege credentialsAudit excessive calls
Supply Chain Compromise (ASI04)--Dependency vetting, supply chain validationMonitor tool behavior
Unexpected Code Execution (ASI05)Avoid code generation tasksPlan-then-reviewSandboxing, disable dangerous toolsFlag dynamic execution
Context Poisoning (ASI06)Separate memory from reasoningMemory validation schemas-Audit memory writes
Inter-Agent Infection (ASI07)-Message signing, schema validation-Log inter-agent comms
Cascading Failures (ASI08)-Fault isolation, rollback plansTimeouts, resource limitsCircuit breaker
Human Trust Exploitation (ASI09)Clear confidence metricsHuman-in-loop approvalConfidence scoringOutput monitoring
Rogue Agents (ASI10)-Version control, deployment approvalsDisable compromised agentsBehavior anomaly detection

Security Testing Tools

PyRIT (Prompt Risk Identification Toolkit) — Red-teaming framework for systematically generating adversarial prompts and measuring agent robustness; supports multi-turn attacks.

Giskard — Open-source ML testing platform with LLM-specific tests for bias, hallucination, prompt injection, and jailbreaks; integrates with model evaluation pipelines.

Promptfoo — CLI and dashboard for evaluating LLM outputs across test cases; supports evals for safety, accuracy, and regression testing.

Lakera — API-based prompt injection and jailbreak detection service; real-time filtering of malicious prompts before they reach the model.

ProtectAI — Monitoring and threat detection for deployed LLM applications; tracks unusual token patterns, reasoning chains, and tool usage.

References

Tags: #aisecurity #ai_agents #llm #cybersecurity #owasp