PREAMBLE: This document assesses the current state of authorization, delegation, and trust verification in production multi-agent AI systems. It is not theoretical. The failure modes described here are observable in deployed systems today, including our own. The purpose is to name the problem clearly enough to actually solve it.
Here is a thing that is true about most production multi-agent systems, including ones that call themselves enterprise-grade: the chain of authorization from user intent to agent action has never been formally audited. The orchestrator trusts its subagents. The subagents trust their tools. The tools trust the inputs they receive. Nobody in that chain is verifying identity. Nobody is checking whether the delegation was actually authorized at each step.
This is not carelessness. It is a structural gap in how the technology was built. The LLM-as-orchestrator pattern emerged from research environments where the threat model was "does the agent accomplish the task," not "did the right entity authorize this action at this step in this context." Production deployments inherited that assumption.
In a human organization, delegated authority is a formal concept. A CEO can authorize a VP to sign contracts up to $500K. That authorization is scoped, documented, and revocable. When an AI orchestrator calls a subagent and that subagent calls a tool that writes to a database — what is the equivalent governance structure? In most systems built today, the answer is: there isn't one.
The following matrix identifies the primary attack surfaces and failure modes in multi-agent trust chains. Severity is assessed based on prevalence in production systems and potential impact of exploitation.
| Threat Vector | Mechanism | Severity |
|---|---|---|
| Prompt Injection via Tool Output | Tool returns content containing instructions the orchestrator processes as authoritative. External data (web pages, emails, documents) becomes a command surface. | Critical |
| Unbounded Tool Authorization | Subagents are given access to tools with no scope constraint. The orchestrator delegates read-write DB access to an agent whose task required only read. | Critical |
| Delegation Depth Creep | Orchestrator → Subagent A → Subagent B → Tool. Each hop inherits full permissions of the caller. By hop 3, a narrowly authorized action has full system access. | High |
| Memory Poisoning | Malicious content written to the agent's persistent memory layer during one session affects behavior in future sessions. The contamination persists and propagates. | High |
| Confused Deputy via Shared Context | An agent with access to user A's context is prompted in a way that causes it to act on behalf of user B's interests, without either user authorizing the cross-context action. | High |
| Goal Misalignment Drift | Subagent optimizes for its assigned metric (task completion) in ways that conflict with the actual user intent. No mechanism catches the divergence until the action is taken. | Medium |
The instinct is to fix prompt injection at the model level — better alignment, better instruction following. This is wrong. Prompt injection in agentic systems is an architectural problem. When an agent calls a web search tool and renders the returned content into its context, that content has the same positional authority as the original system prompt. The model does not distinguish "instructions from the operator" from "content from an external source that happens to contain instructions."
The fix is not alignment — it's architectural separation. Tool outputs must be rendered into a distinct context bucket that the model treats as data, not instruction. This is a prompting and architecture decision, not a model fine-tuning problem.
When you give an agent a set of tools, you are giving that agent access to everything those tools can do. There is no standard mechanism for saying "this agent can call the database tool for reads but not writes" or "this subagent can access user data scoped to this session only." The tools are binary: available or not.
This means a subagent that needs to look up a price also has, by default, the ability to delete records — if the delete function is in the same tool. Nobody is checking at the tool-call level whether the current task context justifies the action being taken.
If an attacker can write to an agent's persistent memory — through any means, including cleverly crafted user inputs that the memory system faithfully stores — that contamination persists across sessions. Every future user of that agent context is affected. This is different from a prompt injection in a single session; this is persistent behavioral modification.
Memory systems designed for utility (remember what users care about, maintain continuity) have the same vulnerability surface as a writable configuration file with no access controls. The features that make memory useful are the same features that make it exploitable.
The mitigation is not to avoid memory — that's a capability regression. The mitigation is to treat memory writes as privileged operations: validate inputs before write, scope memory namespaces by session and user, and implement memory integrity checks on read.
Trust requires a verifiable chain of authorization. Right now, most multi-agent systems have a chain of assumption. Those are different things.
We run a boardroom system with six named AI executives. Carl, Claudia, Cal, Caroline, Conrad — each with a distinct identity, domain, and tool set. The system is production. Real users. Real data. Here is where we discovered the trust gaps firsthand:
The fix was straightforward once identified. But the pattern — orchestrator assumes the subagent response is well-formed, proceeds without validation — is the same pattern that enables prompt injection, permission creep, and memory poisoning. The assumption of clean handoffs is the root of the trust problem.
The mitigations in §03 and §04 address the known failure modes. There is a harder problem that no one in the industry has answered well: how do you verify that an agent is doing what it was asked to do, in real time, at scale?
Human oversight of individual agent actions doesn't scale beyond a few hundred actions per day. Automated oversight means agents watching agents — which reintroduces the trust problem one level up. The math doesn't close.
The answer, when it comes, will not be a model alignment fix. It will be an architectural primitive: a formal delegation protocol with scoped, time-bound, revocable authorization at each node of the agent graph. Something closer to OAuth for agents than anything in the current agent framework stack. We don't have it yet. The field doesn't have it yet. The honest position is to name that gap.
What you can do now, before the formal delegation protocol exists:
read_record() and write_record() are separate tools with separate authorization checks. An agent scoped for read cannot accidentally write. Design the tool surface to make the unsafe action impossible, not just unlikely.