Securing AI Agents and Agentic Workflows

A chatbot that gives a wrong answer is a quality problem. An agent that takes a wrong action is a security incident. The entire difference between the two is the ability to act — to call tools, hit internal APIs, modify state, spend money. The moment you grant that ability, your AI system becomes a privileged actor inside your environment, and almost every comfortable assumption from traditional application security stops holding.

This is the threat model and control set I use for agentic systems. It is less about the model and more about the blast radius around it.

What changes when a model can act

In a classic web application, the trust boundary is clear: untrusted user input arrives, the application validates it, and code you wrote decides what happens. The logic is fixed and auditable.

An agent inverts this. The model — influenced by untrusted input — decides what happens. The instructions and the data arrive through the same channel, in natural language, and the model has no reliable way to tell a legitimate instruction from one an attacker embedded in a document, a web page, or a previous tool result.

Two consequences follow, and they are the core of agentic security:

The control flow is attacker-influenceable. Whoever can get text in front of the model can attempt to steer its actions. This is prompt injection, and against a tool-using agent it is not a content problem — it is privilege escalation.
The data and the instructions share a channel. There is no out-of-band way to say “this part is trusted, this part is just data.” Everything is tokens.

If you internalize only one thing: an agent’s effective permissions are the union of everything its tools can do, available to anyone who can influence its input.

A threat model for agentic systems

I work through five attack surfaces on every agentic design.

Prompt injection, direct and indirect

Direct injection is the user telling the agent to ignore its instructions. The more dangerous variant is indirect injection: malicious instructions hidden in content the agent ingests — a web page it browses, a document it summarizes, a ticket it reads, the output of a tool it called. The agent treats that content as instructions because, to the model, it is just more text.

You do not “fix” prompt injection with a better system prompt. You contain it by limiting what a successfully-injected agent is able to do.

Excessive tool scope

The most common real-world weakness. An agent is given a broad tool — “run any SQL query,” “call any internal endpoint,” a shell — because it is convenient during development. That breadth becomes the attacker’s capability the moment input is compromised. Tools should be narrow, specific, and parameterized, never general.

Confused-deputy and identity confusion

When an agent acts on behalf of a user but with the agent’s own (often broader) credentials, an attacker can induce it to perform actions the user could never authorize. The agent is a deputy confused about whose authority it is exercising. The fix is identity propagation: the agent acts with the user’s authority, not its own ambient privilege.

Memory and context poisoning

Agents that persist memory across sessions can be poisoned: an attacker plants content in one interaction that subtly steers behavior in later ones. Treat persistent agent memory as attacker-influenceable storage, not as trusted state.

Tool-chain and supply-chain exposure

Every tool, plugin, and connected MCP server is part of the attack surface. A compromised or malicious tool can feed poisoned results back into the agent’s context. Inventory and vet what your agents can reach.

Controls that actually contain an agent

The strategy is containment, not perfection. Assume the model will be manipulated and design so that manipulation is survivable.

Scope tools to the minimum

Give an agent the narrowest tools that accomplish its job. Prefer get_order_status(order_id) over query_database(sql). Prefer an API that can only read over one that can also write. Every capability you withhold is an attack you do not have to defend.

Put a human in the loop on consequential actions

Define which actions are consequential — moving money, deleting data, changing access, external communication — and require explicit human approval for those, regardless of how confident the agent is. The approval step is the circuit breaker that turns a compromised agent from an incident into a near-miss.

Treat the agent as a non-human identity

Every agent gets its own scoped identity, short-lived credentials, and least privilege — exactly as you would treat a service account, because that is what it is. No shared keys, no broad standing permissions. When the agent acts for a user, propagate the user’s identity so authorization is evaluated against their permissions, not the agent’s.

Establish trust boundaries for content

Tag where content came from. Content the agent retrieves from external or user-controlled sources is data, not instruction, and downstream tools should treat it with corresponding suspicion. You cannot make the model perfectly honor this, but you can make your tools enforce it — for example, by refusing to act on parameters that originated from untrusted retrieved content without confirmation.

Log the full trace

Record the agent’s inputs, reasoning steps, tool calls, and tool results as a security-relevant audit trail. When something goes wrong with an autonomous system, the trace is the only way to reconstruct what happened and why.

Bound the blast radius

Rate-limit actions. Cap spend. Constrain the agent to a defined set of resources. Define a kill switch and test that it works. The question to design against is not “will this agent ever be manipulated?” but “when it is, what is the worst it can do?” — and then making that worst case small.

The mindset shift

Securing agents is less about the model and more about everything around it. The model is the part you cannot fully control; the tools, identities, approvals, and boundaries are the parts you can. A well-secured agentic system assumes the model is fallible and occasionally adversarially steered, and arranges the world so that this is contained rather than catastrophic.

Get the containment right and you can deploy agents into real workflows with confidence. Skip it, and you have handed a confused, manipulable, privileged actor the keys to your environment — and hoped for the best.

¶ Discussion

Comments are powered by Giscus / GitHub Discussions. They appear here once configured — see Configure Giscus in the project README and update GISCUS in src/consts.ts.