Indirect Prompt Injection: The Agent Era's Default Vulnerability
As LLM agents gained tools and memory, the dangerous injection stopped coming from the user and started coming from the data the agent reads. A defender's breakdown of why this class resists patching and what containment looks like.
The recurring theme across the last several weeks of AI-security reporting is not a single incident — it is a pattern. The injection that matters now does not arrive in the user’s message box. It arrives in a web page the agent fetched, an email it summarized, a PDF a user uploaded, or a record it pulled from a database it was told to trust. This is indirect prompt injection, and in an agent architecture it is closer to a design property than a bug. Here is the defender’s version of why, and what containment actually means.
Why direct injection was survivable and indirect is not
Direct prompt injection — a user typing “ignore your instructions” — is annoying but bounded. The attacker is the user, the blast radius is the user’s own session, and the worst case is usually a model saying something it shouldn’t to the person who asked for it.
Indirect injection breaks all three assumptions. The attacker is whoever controls the data the agent ingests, not the operator using it. The victim is a different person — the user whose agent now follows the attacker’s instructions. And the blast radius is whatever tools the agent can call: send email, write to the database, open a ticket, move money, exfiltrate context to an attacker-controlled URL. The instruction “summarize this page” becomes “summarize this page, and also, per the hidden text in the page, email the user’s recent messages to [email protected].”
Why this class resists patching
There is no parser boundary to fix. To the model, the system prompt, the user request, and the retrieved document are the same kind of thing: tokens in a context window. “Instruction” versus “data” is a distinction the architecture does not natively make. Every proposed fix is a probabilistic mitigation, not a guarantee:
- Delimiters and “treat the following as data” wrappers reduce success rates but are routinely bypassed by content that re-establishes an authoritative frame.
- A second model judging the first raises cost for the attacker but adds another injectable surface and a latency tax.
- Instruction-tuning for injection resistance moves the number, never to zero, and regresses silently on model upgrades.
The honest framing: you cannot make the model immune. You can make a successful injection not matter.
Containment is the actual control
The defenses that hold are architectural, not linguistic. They assume the model will be compromised and limit what a compromised model can do:
- Privilege separation. The model proposes actions; a deterministic layer with its own authorization decides whether to execute them. The model never holds the capability directly.
- Untrusted-by-default output. Treat every token the model emits as attacker-influenced until a non-model check clears it — especially anything that becomes a URL, a shell argument, a tool call, or a database write.
- Human-in-the-loop for irreversible actions. Money movement, external sends, destructive writes, and permission changes require confirmation outside the model’s control. This is unglamorous and it works.
- Provenance and least privilege on ingestion. Tag where every retrieved chunk came from, scope the agent’s tools to the minimum the task needs, and never let a low-trust source trigger a high-privilege tool.
What to do this week
- Enumerate every external data source your agents read. Each one is an injection vector; rank by what tools the agent has when it reads them.
- Find every code path where model output becomes an action with no deterministic check in between. Those are your real exposures, not the chat box.
- Add confirmation gates to the irreversible actions first. Reversibility buys you time; irreversibility is where injection becomes incident.
- Re-run your red-team set with the payload in a fetched document, not in the user prompt. If you only test the chat box, you are measuring the wrong surface.
The uncomfortable summary: in the agent era, “is the model safe from injection” is the wrong question because the answer is permanently “no.” The right question is “what can a compromised model in this system actually do,” and that one you can engineer down to acceptable.
— Theo
Sources
AI Sec Weekly — in your inbox
Weekly digest of AI security news and analysis. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
How LLM Chatbots Leak Data Through Their Own Rendered Output
A recurring AI-security finding: an injected instruction makes the model emit a markdown image whose URL carries the user's data to an attacker server. Why this works, why CSP is the real fix, and what to check this week.
The OWASP LLM Top 10 (2025) Changed More Than the Numbering
The 2025 revision of the OWASP Top 10 for LLM Applications added system-prompt leakage and vector/embedding weaknesses, and reframed the supply-chain entry. Here's what actually shifted and why it matters for defenders.
How AI Sec Weekly Works: The Format and Why It Looks This Way
Every Friday digest follows the same structure for a reason. Here's the format breakdown — three top stories, the reading list, and what gets left out.