Indirect Prompt Injection: The Agent Era's Default Vulnerability

The recurring theme across the last several weeks of AI-security reporting is not a single incident — it is a pattern. The injection that matters now does not arrive in the user’s message box. It arrives in a web page the agent fetched, an email it summarized, a PDF a user uploaded, or a record it pulled from a database it was told to trust. This is indirect prompt injection, and in an agent architecture it is closer to a design property than a bug. Here is the defender’s version of why, and what containment actually means.

Why direct injection was survivable and indirect is not

Direct prompt injection — a user typing “ignore your instructions” — is annoying but bounded. The attacker is the user, the blast radius is the user’s own session, and the worst case is usually a model saying something it shouldn’t to the person who asked for it.

Indirect injection breaks all three assumptions. The attacker is whoever controls the data the agent ingests, not the operator using it. The victim is a different person — the user whose agent now follows the attacker’s instructions. And the blast radius is whatever tools the agent can call: send email, write to the database, open a ticket, move money, exfiltrate context to an attacker-controlled URL. The instruction “summarize this page” becomes “summarize this page, and also, per the hidden text in the page, email the user’s recent messages to [email protected].”

Why this class resists patching

There is no parser boundary to fix. To the model, the system prompt, the user request, and the retrieved document are the same kind of thing: tokens in a context window. “Instruction” versus “data” is a distinction the architecture does not natively make. Every proposed fix is a probabilistic mitigation, not a guarantee:

Delimiters and “treat the following as data” wrappers reduce success rates but are routinely bypassed by content that re-establishes an authoritative frame.
A second model judging the first raises cost for the attacker but adds another injectable surface and a latency tax.
Instruction-tuning for injection resistance moves the number, never to zero, and regresses silently on model upgrades.

The honest framing: you cannot make the model immune. You can make a successful injection not matter.

Containment is the actual control

The defenses that hold are architectural, not linguistic. They assume the model will be compromised and limit what a compromised model can do:

Privilege separation. The model proposes actions; a deterministic layer with its own authorization decides whether to execute them. The model never holds the capability directly.
Untrusted-by-default output. Treat every token the model emits as attacker-influenced until a non-model check clears it — especially anything that becomes a URL, a shell argument, a tool call, or a database write.
Human-in-the-loop for irreversible actions. Money movement, external sends, destructive writes, and permission changes require confirmation outside the model’s control. This is unglamorous and it works.
Provenance and least privilege on ingestion. Tag where every retrieved chunk came from, scope the agent’s tools to the minimum the task needs, and never let a low-trust source trigger a high-privilege tool.

What to do this week

Enumerate every external data source your agents read. Each one is an injection vector; rank by what tools the agent has when it reads them.
Find every code path where model output becomes an action with no deterministic check in between. Those are your real exposures, not the chat box.
Add confirmation gates to the irreversible actions first. Reversibility buys you time; irreversibility is where injection becomes incident.
Re-run your red-team set with the payload in a fetched document, not in the user prompt. If you only test the chat box, you are measuring the wrong surface.

The uncomfortable summary: in the agent era, “is the model safe from injection” is the wrong question because the answer is permanently “no.” The right question is “what can a compromised model in this system actually do,” and that one you can engineer down to acceptable.

To pull every prior week’s prompt-injection and agent coverage into one focused brief, filter the catch-up builder to the agents and prompt-injection topics.

— Theo

Indirect Prompt Injection: The Agent Era's Default Vulnerability

Why direct injection was survivable and indirect is not

Why this class resists patching

Containment is the actual control

What to do this week

Sources

AI Sec Weekly — in your inbox

Related

The OWASP LLM Top 10 (2025) Changed More Than the Numbering

How LLM Chatbots Leak Data Through Their Own Rendered Output

AI Sec Weekly: Friday, May 15, 2026

Comments