Prompt Injection
Prompt injection is an attack where malicious instructions hidden in the input — a user message, a web page, a document the agent reads — trick the model into ignoring its real instructions and doing the attacker's bidding instead.
Also known as: prompt injection attack, indirect prompt injection
A language model can’t reliably tell the difference between instructions from its developer and instructions that arrive inside the data it’s processing. Prompt injection exploits that: an attacker plants commands in something the model reads — a user’s message, a web page an agent browses, a document in a RAG pipeline, an email an assistant summarizes — and the model follows them as if they were legitimate.
The indirect form is the dangerous one for agents. The attacker never talks to the system directly; they poison a source the agent will later consume, and the payload fires when the agent reads it. Because the model fundamentally mixes instructions and data in the same channel, there’s no single patch that removes the risk. Defenses stack instead: constrain what the agent is allowed to do, treat all retrieved content as untrusted, validate outputs and tool calls, and keep a human in the loop for high-stakes actions.