Anthropic had a problem with its AI agents.
They kept dying. When a long-running agent session crashed, everything went with it: the session history, the execution state, and the work in progress.
The brain (Claude and its control loop), the hands (the sandbox where code runs), and the session log were all in a single container. If it failed, there was nothing to recover.
So Anthropic rebuilt that architecture. The result is Managed Agents. It's a hosted service in the Claude Platform that runs long-horizon agents on behalf of developers.
They designed it so that every component can fail and be replaced without losing state. And the performance gains were dramatic.
Median response latency dropped a 60% and over 90% at the tail end.
Let’s dig in!
The Single-Container Problem
The first version of Anthropic's agent put everything inside one container.
The agent harness, the loop that calls Claude and routes its tool calls to the right place, ran alongside the sandbox where Claude executed code, alongside the session log that recorded events.
There were no boundaries or separations. Simplicity was its advantage. File edits were direct system calls. There was nothing to coordinate. But it created a fragile system.
In infrastructure engineering, the difference between "pets" and "cattle” is well-known:
A pet is a server you name, tend carefully, and nurse back to health when it gets sick. Cattle are interchangeable. If one goes down, you replace it with an identical copy and move on.
Anthropic's containers had become pets.
If a container went unresponsive, the team had to analyze it manually. Their only window into the system was a WebSocket event stream.
It could not differentiate between a bug in the harness, a dropped network packet, or the container going offline entirely. All three failures looked the same from the outside.
Debugging meant opening a shell in the container. But that container also held user data. It meant the very act of investigating a failure created a security concern.
Teams couldn't debug production issues safely. A second problem arose when customers wanted to connect Claude to resources inside their own virtual private cloud.
Because the harness assumed everything it needed sat inside its container, the only option was to peer the customer's network with Anthropic's.
That architectural assumption had become a deployment constraint.
The Operating System Analogy
The solution comes from one of the oldest design patterns, virtualisation.
Decades ago, operating systems (OS) solved this same problem. Hardware was specific, but the programs that would run on it hadn't been invented yet.
Therefore, operating systems created abstractions, such as a "process" and a "file." These were general enough for software that didn't exist.
The read() command works the same whether it is accessing a disk pack from the 1970s or a modern SSD. The interface was stable. The hardware underneath changed.
Managed Agents applies this principle to AI agent infrastructure.
Instead of coupling everything together, Anthropic virtualised the three core components of an agent into separate interfaces:
The session: It's an append-only log of everything that happened during an agent's work. It stores events outside both the harness and the sandbox.
The harness: It's the control loop that calls Claude, receives tool-call requests, and routes them to the right place. It reads from and writes to the session, but does not depend on any specific sandbox.
The sandbox: It's an execution environment where Claude can run code and edit files. It is accessed through a simple interface: execute(name, input) → string.
Each component can fail independently. Each can be replaced without disturbing the others. The design is opinionated about the shape of these interfaces, but gradually unopinionated about what runs behind them.

The Managed Agents architecture: each component is a separate interface that can fail and be replaced independently (source)
The Decoupling
Anthropic describes the separation using a memorable metaphor.
The harness is the brain. It reasons, decides, and calls Claude.
The sandbox is the hands. It can do actions. The session is the persistent record that both can reference.
In the new architecture, the harness no longer lives inside the sandbox container.
It pulls the container the same way it does any other tool: the name goes in as input, and a string comes out. The container becomes cattle. If it dies, the harness catches the failure as a tool-call error and passes it back to Claude.
If Claude decides to retry, a fresh container starts with a standard provisioning recipe (no more nursing sick containers). The harness itself becomes cattle.
Because the session log sits outside the harness, nothing in the harness needs to survive a crash. The harness is stateless. It can be replaced at any time.
The security boundary gets structural. In the old coupled design, credentials and untrusted code lived in the same container.
A prompt only had to convince Claude to read its own environment variables. If an attacker got those tokens, they could spawn fresh sessions and delegate work to them.

Every component is defined by its interface, not its implementation. Any technology that satisfies the contract can be swapped in (source)
The fix was architectural. Credentials never enter the sandbox.
For Git access, the repository's access token is used to clone during sandbox initialisation and wired into the local remote. Git push and pull work from inside the sandbox without the agent ever seeing the token.
For external tools accessed via MCP (Model Context Protocol), OAuth tokens are stored in a secure vault. Claude calls tools through a dedicated proxy that fetches credentials from the vault and makes the call.
The harness itself never touches any credentials either.
The Session as Durable Memory
Long-running agent tasks often exceed Claude's context window, the amount of text the model can hold in working memory at once.
The usual solutions all involve choices about what to keep and discard. Compaction lets Claude save a summary and discard the originals. A memory tool writes notes to files.
Context trimming removes old tool results and reasoning blocks.
But all of these are one-way doors. If a compacted summary misses a detail that matters three steps later, that detail is gone.
Managed Agents addresses this by making the session log a durable context object that lives entirely outside Claude's context window.
The interface, getEvents(), lets the harness study the full event history.
It can pick up from where it last stopped reading.
It can rewind to a specific moment to see what happened.
It can re-read the context before taking a critical action.
Fetched events can also be transformed before being passed to Claude: reorganised for cache efficiency, trimmed for relevance, or restructured for whatever context engineering the current model needs.
The session handles storage.
The harness handles context management. These concerns are separated because Anthropic cannot predict what strategies future models will require.
The Performance Payoff
The decoupled architecture also reshaped how the system scales. In the coupled design, every agent session required a dedicated container.
No inference could begin until that container was provided, cloning repositories, booting processes, or fetching pending events.
Even sessions that would never touch the sandbox followed this. Time-to-first-token (TTFT) measures how long a session waits between accepting work and producing its first response.
It is the latency users feel most acutely. With the decoupled architecture, containers are provided only when needed via a tool call, like any other action.
A session that doesn't need a sandbox right away does not wait for one. Inference starts as soon as the harness pulls pending events from the session log.
The result is that median TTFT dropped roughly 60%.
The 95th percentile, the slowest 5% of sessions, the tail latency that determines worst-case user experience, dropped over 90%. Scaling to many brains means starting many stateless harnesses and connecting them to hands only when required.
The architecture also allows connecting each brain to multiple execution environments.
Claude reasons about which environment to send work to, a harder cognitive task than operating in a single shell, but one that improving model intelligence makes increasingly natural.
Because no hand is coupled to any brain, hands can even be passed between agents. The interface is always the same: execute(name, input) → string.
The harness does not know or care whether the sandbox is a container, a phone, or, as the Anthropic team notes, a Pokémon emulator.
Designing for Programs That Don't Exist Yet
The most interesting choice in this design is what it intentionally leaves open.
Managed Agents is not a specific harness. It is what the team calls a "meta-harness," a system of interfaces that can adapt any harness, including ones that do not exist yet.
Claude Code is one harness. Task-specific harnesses outperform general-purpose ones in narrow domains. Managed Agents can run any of them.
The interfaces guarantee that the session is durable, the sandbox is reachable, and the brain is replaceable. Everything else is left to the implementation.
This matters because agent harnesses encode assumptions about what the model can't do on its own. Those assumptions go stale as models improve.
Anthropic found this firsthand. Claude Sonnet 4.5 would wrap up tasks prematurely as it sensed its context limit approaching, a behaviour the team called "context anxiety."
They added context resets to the harness to compensate, but when they ran the same harness on Claude Opus 4.5, the behaviour was gone.
The resets had become dead weight. A harness that compensated for a model limitation had become a constraint on its heir.
By designing interfaces that outlast any specific implementation, Anthropic is betting that the hardest part of building agent infrastructure is ensuring today's solutions do not become tomorrow's constraints.
