Six frontier coding models now score within 0.8 points of each other on SWE-bench Verified. The same model wrapped in different agent frameworks swings almost ten points on SWE-bench Pro. Picking an agent platform based on which model it runs misses where the real performance differences come from.
The useful shift in framing came from Birgitta Böckeler at Thoughtworks, writing on martinfowler.com: Agent = Model + Scaffolding. The scaffolding is everything that isn't the model — the tool definitions, the context compaction, the error recovery logic, the feedback sensors, the system prompt, the memory between sessions. That's the layer where most of the variance in real-world agent performance actually lives.
The data
Particula Tech's analysis lines up four agent frameworks running the same Claude Opus 4.5 against SWE-bench Pro:
| Framework | Model | Score |
|---|---|---|
| SEAL (standardized scaffold) | Opus 4.5 | 45.9% |
| Cursor | Opus 4.5 | 50.2% |
| Auggie (Augment) | Opus 4.5 | 51.8% |
| Claude Code | Opus 4.5 | 55.4% |
Same weights, 9.5-point spread. And when Meta and Harvard's Confucius Code Agent ran Sonnet 4.5 with its own scaffold, it scored 52.7% — beating Opus 4.5 on Anthropic's stock framework at 52.0%. A cheaper model with better scaffolding beat the flagship on its vendor's own agent.
Single-variable changes produce similar results. Grok Code Fast went from 6.7% to 68.3% on coding benchmarks after changing only the edit tool format — same model, same prompts. LangChain's coding agent moved from 52.8% to 66.5% on Terminal Bench 2.0 by improving task decomposition and tool use, with no model swap. Adding WarpGrep as a specialized search subagent added 2.1 to 3.7 points across every model tested, while cutting cost 15.6% and runtime 28%.
What Anthropic and OpenAI are publishing
Both labs have written up their own agent framework work in the past year, and neither piece is about model training.
Anthropic's "Effective harnesses for long-running agents" describes a two-part pattern for agents working across many context windows. An initializer agent writes a feature list as structured JSON, an init.sh script, and a progress file. A coding agent reads the progress file, picks one feature, commits its work, and updates progress. Same model in both roles — the difference is what the scaffolding makes visible between sessions.
The observation that drove this: Opus 4.5 running on the Claude Agent SDK in a loop, given the prompt "build a clone of claude.ai," would run out of context mid-implementation, then declare victory too early in a later session because the environment looked mostly done. The fix wasn't a better model. It was a scaffolding pattern that handed off state correctly.
OpenAI's writeup on agent-first engineering goes further. Three engineers shipped a million lines of code across 1,500 PRs in five months with no manually-written code. Codex wrote the application logic, the tests, the CI configuration, the docs. The humans' job was to design the environment — wiring Chrome DevTools into the agent runtime so Codex could drive its own UI, exposing logs via LogQL and metrics via PromQL, making the app bootable per git worktree so each task ran on an isolated instance.
Their framing: "Our most difficult challenges now center on designing environments, feedback loops, and control systems." That's the team at the company that trains the models saying the models aren't the bottleneck.
Where the cycles should go
For teams actually building agents, the practical implication is that model upgrades at the frontier buy roughly one point on benchmarks, while scaffolding improvements can buy twenty or more. The components with measurable impact are fairly consistent across the published case studies: tool orchestration (how tools get discovered and composed), context management (compacting aggressively, keeping recent tool outputs intact, persisting structured state between sessions), error recovery (the gap between 42% and 78% on SWE-bench is largely recovery from mistakes rather than fewer mistakes), deterministic feedback sensors like linters and structural tests that run on every change, and planning-execution separation where one agent decides and another does.
None of this requires a frontier model. A well-scaffolded Sonnet beats a poorly-scaffolded Opus, at about a fifth the cost.
The model-as-differentiator era appears to be ending. Frontier capability has converged within a point or two, and the interesting engineering — the work that actually moves benchmarks and production reliability — is happening in the scaffolding around the model.
Sources:
- Böckeler, Birgitta. "Harness engineering for coding agent users." martinfowler.com, April 2026.
- Mondragon, Sebastian. "Agent Scaffolding Beats Model Upgrades: 42% to 78% on SWE-Bench." Particula Tech, March 2026.
- Anthropic. "Effective harnesses for long-running agents." November 2025.
- Lopopolo, Ryan. "Harness engineering: leveraging Codex in an agent-first world." OpenAI, February 2026.