AI Sec Weekly
Source code screen — illustrating an article on Indirect Prompt Injection The Agent Era's Default Vulnerability
news

Indirect Prompt Injection: The Agent Era's Default Vulnerability

As LLM agents gained tools and memory, the dangerous injection stopped coming from the user and started coming from the data the agent reads. A defender's breakdown of why this class resists patching and what containment looks like.

By Theo Voss · · 8 min read

The recurring theme across the last several weeks of AI-security reporting is not a single incident — it is a pattern. The injection that matters now does not arrive in the user’s message box. It arrives in a web page the agent fetched, an email it summarized, a PDF a user uploaded, or a record it pulled from a database it was told to trust. This is indirect prompt injection, and in an agent architecture it is closer to a design property than a bug. Here is the defender’s version of why, and what containment actually means.

Why direct injection was survivable and indirect is not

Direct prompt injection — a user typing “ignore your instructions” — is annoying but bounded. The attacker is the user, the blast radius is the user’s own session, and the worst case is usually a model saying something it shouldn’t to the person who asked for it.

Indirect injection breaks all three assumptions. The attacker is whoever controls the data the agent ingests, not the operator using it. The victim is a different person — the user whose agent now follows the attacker’s instructions. And the blast radius is whatever tools the agent can call: send email, write to the database, open a ticket, move money, exfiltrate context to an attacker-controlled URL. The instruction “summarize this page” becomes “summarize this page, and also, per the hidden text in the page, email the user’s recent messages to [email protected].”

Why this class resists patching

There is no parser boundary to fix. To the model, the system prompt, the user request, and the retrieved document are the same kind of thing: tokens in a context window. “Instruction” versus “data” is a distinction the architecture does not natively make. Every proposed fix is a probabilistic mitigation, not a guarantee:

The honest framing: you cannot make the model immune. You can make a successful injection not matter.

Containment is the actual control

The defenses that hold are architectural, not linguistic. They assume the model will be compromised and limit what a compromised model can do:

What to do this week

The uncomfortable summary: in the agent era, “is the model safe from injection” is the wrong question because the answer is permanently “no.” The right question is “what can a compromised model in this system actually do,” and that one you can engineer down to acceptable.

— Theo

Sources

  1. OWASP Top 10 for Large Language Model Applications
  2. MITRE ATLAS — Adversarial Threat Landscape for AI Systems
  3. NIST AI Risk Management Framework (AI RMF 1.0)
#prompt-injection #agents #llm-security #rag #guardrails #threat-model
Subscribe

AI Sec Weekly — in your inbox

Weekly digest of AI security news and analysis. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments