The first time you trim an eval because it costs too much to run, you have started letting the pricing model design your agent.
The problem with token pricing for AI is not magnitude — tokens aren't expensive in aggregate, by knowledge-work standards. The problem is that the unit of charge sits at the wrong decision point. Itemized per-call billing puts a meter on every supervisory action an engineer might take. Bundled billing would hide the same cost. The compute is identical either way. The pricing UX changes the engineer's decision.
The concrete shape
Ask a frontier model to find the bug in a 600-line migration. With extended reasoning enabled, it generates roughly 8,000 tokens of internal deliberation and outputs 200 tokens of fix. The bill is 8,200 tokens. The work delivered is the fix. The other 8,000 are the model thinking on the way there.
The reasoning tokens often produce better answers. The customer is getting real value from the deliberation. What goes wrong is variance absorption: linguistic length and work delivered are loosely correlated, and the customer covers the gap on every call. Provider revenue tracks linguistic length. The work customers actually want done varies on a different axis.
The structural distortion
The pricing UX is itemized. Every call has a meter. The engineer who wires in an eval suite, a dry-run check, an output schema validator, a proof-chain log — each of those is a per-call decision with a visible cost. The team that takes supervision seriously pays the highest bill.
Per-task or per-seat pricing bundles the same costs. The eval, the schema check, the verification step sit inside the contract. The engineer doesn't see a per-call charge when running them. The compute happens. The decision the engineer faces is different.
The pricing model taxes the practice. Whatever you build, you build against an economic gradient that pulls toward fewer assertions, shorter prompts, less verification.
Last week I argued tests are the substrate underneath dependable AI. Every assertion is a model call. Every eval run is a stack of tokens. The pricing UX selects against the practice the engineering thesis selects for, on the same project, at the same decision point.
A historical note
Pulp-era writers were paid per published word, which produced bloat — padding that earned more because it was longer. Token pricing is not strictly the same; the extra tokens often improve the output. The shared shape is variance absorption: the unit of charge is linguistic length, and the customer absorbs whatever ratio the producer's output has to the work the customer wanted done. Every prior generation of knowledge work tried alternatives to input metering. Most kept some form of it — law is still mostly hourly, alternative fee arrangements have a thirty-year track record of not sticking, capitation and fixed-bid both have known perverse-incentive failure modes. The alternatives that took root tied price to something closer to the work than to the linguistic output.
Why outcome pricing is hard
The natural reply is: price by outcome, not by token. The direction is right. The reasons it has not happened are also real.
Outcome pricing requires a contract on what counts as a successful outcome. For varied agent work — research, code, design, customer support — acceptance criteria differ per task. Providers do not want to absorb the variance of "did the customer think this was good." Customers do not want to define acceptance criteria for every call. Capitation, fixed-bid, and performance-based pricing all have known perverse-incentive failure modes that have shown up across every prior industry that tried them.
Token pricing is what you get when neither party wants to specify the contract. The provider bills compute because that is what costs them to serve. Customers pay compute because that is what is offered. The question of whether the work got done sits between them with neither party on the hook.
Wrapping, not replacement
Tokens probably won't disappear at the wholesale layer. They are what compute actually costs; providers will keep billing what they spend. What does change is the unit of contract between provider and customer. AWS still bills compute-seconds under everything, but most enterprise customers never see a compute-second on an invoice — they see a managed-service contract that wraps compute-seconds inside a per-task or per-resource price. The provider absorbs the variance.
The AI market is heading the same way. Anthropic's Claude Code Max sits above per-token consumption and caps the variable cost. GitHub Copilot is per-seat. Cursor is tiered per-seat with consumption guardrails. Cognition's Devin charges in Agent Compute Units — a consumption budget abstracted above tokens. None of these are outcome pricing. They are wrapping. The customer no longer sees the syllable on the invoice. The wholesale unit underneath is still the token.
Who defects first matters. Token pricing is profitable and the variance risk sits with the customer; established providers have little reason to move first on margin alone. The historical move off per-word pricing in journalism came through labor organization and professionalization, not customer dissatisfaction. The analogous mover here is most likely a new entrant willing to absorb variance in exchange for share, or an established provider under margin pressure packaging tokens behind a contract to differentiate. Whoever moves first sets the wrapping standard. The token persists; the bill changes shape.
Build the artifacts anyway
If you are running a serious AI stack today, the pricing UX is something you absorb whether you notice it or not. The practical move is to build the artifacts anyway. The supervisory stack — tests, evals, dry-runs, proof chains — pays for itself in incidents avoided. The bill is real. The engineering is right.
The harder move is to notice when the gradient is pulling against you. Every team that builds disciplined supervision on top of a per-token API also builds ongoing pressure to skip the supervision when the token bill arrives. The per-call habit of asking can we afford to run this one? every time the meter clicks is the habit that hands the pricing model authorship of your agent.
Build the artifacts anyway. Notice when the gradient is pulling against you. Push for contracts that hide the per-call meter, so the engineer's daily decision is engineering, not accounting.