AI Agent Security: Prompt Injection

Home

Blog

AI Agent Security: Defending Against Prompt Injection in Production

Short answer: AI agent security is the discipline of keeping an autonomous agent from being manipulated into doing harm through the very content it reads and the tools it can call. The dominant threat is prompt injection — OWASP ranks it as the number-one LLM risk (LLM01) — and the dangerous variant for agents is indirect prompt injection, where an attacker hides instructions inside a document, email, web page, or database record that the agent later ingests and obeys as if they came from you. The hard truth is that there is no complete fix: frontier labs have publicly acknowledged that prompt injection may never be fully patched, so security has to come from architecture, not a single guardrail. The agents that are safe to run in production are the ones built with least privilege, sandboxed tool execution, human approval on high-stakes actions, and tracing — defense in depth — rather than a clever system prompt.

Security is one of the quiet reasons agents stall between a demo and production. A demo runs in a trusted sandbox where nothing is adversarial. A production agent reads untrusted email, scrapes live web pages, and acts on customer records — every one of which is a channel an attacker can write to. Below is what prompt injection actually is, why agents are uniquely exposed, the real incidents that show the stakes, the OWASP framework that now names the risks, and how we design agents to survive contact with hostile input.

What prompt injection is — and why agents make it worse

A large language model cannot reliably tell the difference between instructions from its developer and instructions that arrive inside the data it is asked to process. To the model, it is all text in the same context window. Prompt injection exploits exactly this: it smuggles commands into content the model reads, and the model follows them.

There are two flavors, and the distinction is everything for agents.

A chatbot that only talks is barely exposed to indirect injection because it does not go out and read untrusted things. An agent is defined by the opposite: it reads external sources and it acts through tools. That combination — untrusted input flowing into a system that can send email, move money, modify records, or execute code — is what turns a language quirk into a security incident. The more capable the agent, the larger the blast radius when an injection lands.

The real incidents that set the stakes

This is not theoretical. A run of documented 2025–2026 incidents shows indirect injection moving from research demos into production systems.

The clearest example is EchoLeak (CVE-2025-32711), a zero-click vulnerability in Microsoft 365 Copilot rated CVSS 9.3. A single crafted email — never opened, never clicked — carried hidden instructions that Copilot ingested during a routine summarization task. The agent then pulled sensitive data from the user's OneDrive, SharePoint, and Teams and exfiltrated it through a trusted Microsoft domain. The victim did nothing. The agent did everything, because it read the wrong email.

Other reported cases follow the same shape. A vulnerability in Slack's AI assistant let hidden instructions in a message trick the assistant into surfacing a malicious link that, when clicked, leaked data from a private channel. Across security audits, OWASP-cited assessments found prompt injection present in roughly 73% of production AI deployments, with attack success rates reported between 50% and 84% depending on configuration and the number of attempts. On February 13, 2026, OpenAI shipped a "Lockdown Mode" for ChatGPT and stated plainly that prompt injection in AI browsers "may never be fully patched."

The pattern in every case is the same: the agent was trusted to act on content it could not vouch for. That is an architecture problem, and it is fixable — but not with a better prompt.

The OWASP Top 10 for Agentic Applications

For most of the LLM era the reference point was the OWASP Top 10 for LLM Applications, where prompt injection sits at LLM01. In December 2025, OWASP's Gen AI Security Project published a framework aimed specifically at agents: the OWASP Top 10 for Agentic Applications (2026), peer-reviewed with input from over a hundred security researchers and practitioners. It is the first widely adopted vocabulary for agent-specific risk, and it is worth knowing because it reframes security around what agents do, not just what they say.

The categories most relevant to day-to-day agent engineering include:

Agent Goal Hijack (ASI01) — attackers redirect the agent's objective by manipulating instructions, tool outputs, or external content. This is indirect prompt injection seen from the agent's side.
Tool Misuse & Exploitation (ASI02) — the agent is steered into using its legitimate tools for harm: calling an API it shouldn't, deleting data, sending a message, spending money.
Identity & Privilege Abuse (ASI03) — the agent's inherited credentials, delegated permissions, or agent-to-agent trust get exploited, so the attack inherits whatever access the agent has.
Agentic Supply Chain Vulnerabilities (ASI04) — a tampered tool, model, descriptor, or agent persona compromises execution before any user input arrives.
Unexpected Code Execution (ASI05) — the agent generates or runs attacker-controlled code.
Memory & Context Poisoning (ASI06) — persistent corruption of the agent's memory, RAG store, or context, so a single injection keeps influencing behavior across future sessions.

The remaining categories extend through to Rogue Agents (ASI10) — compromised or misaligned agents that diverge from intended behavior. The throughline is that an agent's power is its attack surface. Every tool you grant, every source it reads, every credential it holds, and every fact it remembers is something that can be turned against you.

Why there is no single fix

The instinct is to ask for the patch — the one setting or guardrail that closes the hole. It does not exist, and pretending otherwise is the most dangerous posture in agent security.

The reason is structural. Prompt injection is not a bug in a specific model that a vendor can fix; it is a consequence of how language models work. They process instructions and data in the same channel, and no current technique reliably separates the two. Classifier-based filters that try to catch injection attempts help at the margin but are routinely bypassed, because the space of ways to phrase a malicious instruction is effectively infinite. Frontier models from every major lab remain exploitable after their best mitigations are applied. That is why the labs themselves describe the problem as unsolved.

This changes the goal. You are not trying to make injection impossible — you cannot. You are trying to make a successful injection harmless: to ensure that even when the model is fooled, it cannot reach anything that matters. That is the entire philosophy of defense in depth, and it is the difference between an agent that fails safe and one that fails catastrophically.

Defense in depth: the controls that actually hold

Securing an agent is about constraining what a hijacked agent can do, so a bad instruction hits a wall instead of a tool. A few controls carry most of the weight.

None of these is novel security thinking — least privilege, sandboxing, human approval, and audit logging are decades old. What is new is applying them rigorously to a system that takes instructions from the open internet. The agents that belong in production are the ones where these controls were designed in from the start, not bolted on after an incident.

Security is a scoping decision, not a final step

The most expensive mistake is treating security as a review you run before launch. By then the agent's privileges, tool set, and data access are baked in, and tightening them means re-architecting. Security has to be decided at scoping, alongside the capabilities — because in agents, the capabilities are the risk.

That means asking the uncomfortable questions early. Does this agent really need write access, or can a human approve the write? Does it need to browse the open web, or a vetted set of sources? What is the worst thing it could do if every instruction it read today were hostile — and are we comfortable with that answer? An agent scoped with those questions in mind tends to be both safer and more useful, because narrow, well-defined agents outperform sprawling ones anyway. Security and quality pull in the same direction more often than teams expect.

How Moai Team approaches this

We treat agent security as part of architecture, decided when we scope the agent, not as a gate at the end. The first questions we ask are about exposure and blast radius: What untrusted content will this agent read? What tools and credentials does it actually need versus what would be convenient? Which actions are irreversible or sensitive enough to require a human? Those answers shape the design before a line of the harness is written.

From there we build defense in depth into the harness itself. We apply least privilege to every tool and credential, keep untrusted external content separated from trusted instructions, sandbox anything that executes, and put human approval in front of high-stakes actions by default. We guard the memory and RAG layers so a poisoned entry cannot become a persistent compromise, and we wire in tracing so every prompt, tool call, and data flow is visible and auditable. Then we make adversarial behavior a first-class case in the eval harness — injection attempts become tests the agent has to pass, run continuously, rather than a one-time pentest. This is the same discipline that separates a demo from a production agent everywhere else: it is easy to make an agent that works when nothing is hostile, and it is engineering to make one that holds when something is.

Frequently Asked Questions

What is prompt injection in AI agents?

Prompt injection is an attack where adversarial instructions are smuggled into the content an AI model processes, causing the model to follow them instead of its intended task. For agents, the dangerous form is indirect prompt injection: the malicious instructions are hidden inside external content the agent retrieves on its own — a web page, email, PDF, database record, or tool output — rather than typed by a user. Because agents both read untrusted sources and act through tools, an injection can lead to data exfiltration, unauthorized tool use, or other real-world harm. OWASP ranks prompt injection as the number-one LLM security risk (LLM01).

Can prompt injection be completely prevented?

No. Prompt injection is a consequence of how language models process instructions and data in the same channel, not a fixable bug in a particular model. Frontier labs including OpenAI, Google, and Anthropic remain vulnerable after applying their best mitigations, and OpenAI has publicly stated the problem "may never be fully patched." The realistic goal is not to make injection impossible but to make a successful injection harmless — through defense in depth: least privilege, sandboxed execution, human approval on sensitive actions, and full tracing, so a fooled model cannot reach anything that matters.

What is the OWASP Top 10 for Agentic Applications?

It is a security framework published by OWASP's Gen AI Security Project in December 2025, peer-reviewed with input from over a hundred researchers and practitioners, that names the most critical risks specific to autonomous and agentic AI systems. Its categories include Agent Goal Hijack (ASI01), Tool Misuse & Exploitation (ASI02), Identity & Privilege Abuse (ASI03), Agentic Supply Chain Vulnerabilities (ASI04), Unexpected Code Execution (ASI05), Memory & Context Poisoning (ASI06), and through to Rogue Agents (ASI10). It reframes agent security around what agents do — calling tools, holding credentials, remembering state — rather than only what they say.

What is the single most effective defense for AI agents?

Putting a human in the loop on high-stakes actions. Requiring explicit human approval before an agent does anything irreversible or sensitive — sending external email, moving money, making purchases, modifying or deleting data, or executing code — is the most effective single control against tool abuse, because it places a person between an injected instruction and its consequence. It works best combined with least privilege and sandboxed tool execution, so that the actions a human never needs to approve are also the only ones the agent can take unsupervised.

Moai Team builds AI agents with security designed in at scoping — least privilege, sandboxed tools, human approval on sensitive actions, guarded memory, and adversarial evals — so they hold up when the input turns hostile, not just in a friendly demo. Schedule a call.

Contents

Do you have any questions about software development?

We’re delighted to offer a free, no-obligation consultation to answer all your questions and give honest advice

Schedule a free consultation

AI Agent Security: Defending Against Prompt Injection in Production