The Model Doesn't Matter. The Harness Does.
Six frontier coding models now score within 0.8 points of each other on SWE-bench Verified. The same model wrapped in different agent frameworks swings almost ten points on SWE-bench Pro. Picking an agent platform based on which model it runs misses where the real performance differences come from.
The