Some Problems Are Too Big for One Context Window

Internal platforms used to be the exotic option. Now they are the default. A December 2025 Platform Engineering survey of 518 organizations found that nearly 90% of enterprises run internal platforms, with a dedicated platform-product-manager role appearing in a quarter of them. The interesting part is not the adoption curve. It is what teams are starting to run on those platforms: not single AI calls, but coordinated pipelines of AI agents doing work that no single agent could finish.

That shift has a forcing function, and it is mechanical, not philosophical. Some problems do not fit in one context window. When you hit that wall, the fix is not a better prompt. It is orchestration.

The Wall Is Capacity, Not Cleverness

Here is the number that settles the argument. Anthropic ran a multi-agent system — an Opus 4 lead delegating to Sonnet 4 workers — against a single Opus 4 agent on their internal research eval. The multi-agent system outperformed the single agent by 90.2%. The single agent’s failure mode was concrete: asked to enumerate all the IT board members of the S&P 500, it ground through companies one at a time and never finished. It was not dumb. It ran out of room.

The same write-up makes the underlying mechanic explicit. Raw token usage explains 80% of the performance variance on their browsing benchmark. Subagents help because each one “operates in parallel with their own context windows,” then condenses the most important tokens back to the lead. You are not making the model smarter. You are giving the problem more total working memory than one session holds.

That is the whole thesis in one line. Comprehensiveness is a capacity problem. Take a concrete case, for example: a task like “review these fifty files the same way” or “search this entire surface and miss nothing” has no room in a single conversation. You need parallel contexts and a way to coordinate them.

What Orchestration Actually Buys You

There are two distinct wins here, and people conflate them.

The first is isolation. When you hand work to a subagent, it runs with fresh context and a separate cache, and the verbose output — test logs, file dumps, intermediate reasoning — “stays in the subagent’s context while only the relevant summary returns to your main conversation.” This is why “review fifty files the same way” is an orchestration problem. You spawn a reviewer per file. Each one drowns in its own file’s detail. Your main session only ever sees fifty clean summaries. The noise never touches the conversation that has to hold the final answer.

The second win is determinism, and this is the part that turns a clever trick into infrastructure. A dynamic workflow runs from a script Claude writes and you can rerun. The script — not the model, turn by turn — holds “the loop, the branching, and the intermediate results itself, so Claude’s context holds only the final answer.” The control flow lives in JavaScript with structured-output schemas. The model fills in the work; the code decides what runs next.

That distinction matters because it makes the pipeline repeatable. Run it Monday on fifty files, run it Friday on a different fifty, and the loop behaves identically. You are not re-explaining the process in prose every time and hoping the model interprets it the same way twice. The process is code.

Adversarial Verification Is the Killer Feature

The workflows model lets you do something a single session cannot honestly do: have independent agents adversarially review each other’s findings before they are reported.

Think about why a single conversation is bad at checking its own work. It already produced the answer. It has every reason — mechanically, in how attention weights the tokens already on the page — to be consistent with what it just said. Asking it to find its own errors is asking it to argue against its own context.

A separate agent has no such loyalty. It gets a fresh context, the prompt you pass, and one job: find the holes. The script wires the producer’s structured output into the verifier’s input, collects both, and only then surfaces a result. That is a real audit, not a model grading its own homework. For anything where being wrong is expensive — a security pass, a financial reconciliation, a research claim that is going to get quoted — that second adversarial context is the difference between a draft and something you would put your name on.

Which Problems Are Just Prompts

Here is where the skill lives, because orchestration is not free and reaching for it reflexively is its own failure.

The same Anthropic analysis is blunt about the cost. Multi-agent systems burn about 15× the tokens of a chat — even a single agent runs roughly 4× a normal conversation. So the economics only work on “valuable tasks that involve heavy parallelization, information that exceeds single context windows, and interfacing with numerous complex tools.” They explicitly call out where it does not fit: most coding tasks, where the subtasks are not really parallelizable and the agents have tight dependencies on each other’s output. If step two needs step one’s result, you do not have a fan-out. You have a sequence, and a sequence belongs in one context.

The cautionary tale is theirs too. Early versions of their agents would “spawn 50 subagents for simple queries” — fan-out applied to a problem that never needed it, paying 15× for an answer one prompt would have produced. That is the tell. When you find yourself orchestrating something a single well-written prompt would have handled, you have mistaken a prompt for a pipeline.

So the decision rule is small and concrete. Reach for orchestration when the work is wide (many independent items, fifty files, a whole search surface), when it exceeds what one context can hold, or when you genuinely need one agent to check another. Stay in a single session when the work is narrow, sequential, or dependency-heavy. The deterministic script does not make a small problem better. It makes a big problem possible.

The Real Boundary

The platform-engineering numbers tell you orchestration is becoming an infra-team concern rather than a novelty. But the adoption stat is not the point. The point is the boundary it forces people to learn.

The valuable judgment is no longer “can I write a good prompt.” It is “is this an orchestration problem or a prompt problem.” Get that wrong in the expensive direction and you pay 15× for nothing. Get it wrong in the cheap direction and you stuff fifty files into one window and watch the model lose the thread halfway through.

The script holds the loop. The subagents hold the parallel work. Your context holds only the answer. Knowing when that shape is worth building is the actual skill — and it is the one worth getting right, because the problems that need it are exactly the ones too big to fix any other way.