Cheaper tokens, bigger bills
Why your AI bill goes up as prices come down, and what Uber's blown budget actually tells us.
Last week I argued that the model is becoming a commodity: the price of a fixed capability is collapsing toward the cost of the compute underneath it, even as the frontier stays expensive. I stand by that. But it left out half the story, and this week the missing half showed up in the news with a name attached.
Uber burned through its entire 2026 AI coding-tools budget in four months.
Not its AI budget writ large. Its coding-tools budget, the Claude Code line specifically, gone by April. The detail that makes it worth writing about is how: Uber had stood up an internal leaderboard ranking teams by total AI tool usage. They gamified consumption. They got exactly what they incentivized, which was consumption, and then the COO went on a podcast and said the quiet part out loud. He said it was "very hard to draw a line" between all that usage and useful features actually shipped to customers. "That link," he said, "is not there yet."
Microsoft, in the same window, reportedly started canceling most of its direct Claude Code licenses and pushing engineers toward GitHub Copilot CLI.
If you only read the last post, these stories look like a contradiction. Tokens are getting cheaper by the month, so how does a company blow its annual budget in a third of a year? The answer is the part I underweighted, and it's important enough to deserve its own piece.
Two things are falling and rising at the same time
Here is the trap. The price per token and the total bill are moving in opposite directions, and almost everyone conflates them.
Per-token price is falling, hard, just as I described. Gartner projects that by 2030, inference on a 1-trillion-parameter model will cost providers more than 90% less than it did in 2025. Epoch AI's research puts the per-capability price decline at between 9x and 900x per year, depending on the benchmark and score level. That's real.
But total enterprise spend is climbing, because consumption is rising faster than price is falling. Goldman Sachs forecasts a 24-fold increase in token consumption by 2030, reaching 120 quadrillion tokens per month, driven primarily by enterprise-grade agents. And the driver isn't just more adoption, it's the shape of the new workloads. Agentic AI, the long-horizon, multi-step, tool-using kind that I keep insisting is the future, consumes 5 to 30 times more tokens per task than a single-shot chatbot query. Every reasoning step, every tool call, every self-check, every retry burns tokens. You moved from asking the model a question to deploying the model as a worker that thinks for thirty minutes, and the meter reflects it.

That's the scissors. The unit gets cheaper, the usage explodes, and the bill is the product of the two. Gartner's Will Sommer put it more sharply than I could: "CPOs should not confuse deflation in commodity tokens with democratization of frontier reasoning." Cheap tokens are not the same thing as cheap AI.
This is not the "AI doesn't work" story, even though it'll be sold as one
A lot of people are going to pick up the Uber headline and run it as proof that the whole thing is a bubble, that agentic AI doesn't deliver, that the emperor has no clothes. I want to be careful here, because that reading is wrong, and the reason it's wrong is the whole point.
Uber's problem was not that Claude Code doesn't work. Their own CEO said on the Q1 2026 earnings call that about 10% of committed code is now built by autonomous agents. The capability is real and it's being used at scale. Their problem was that they scaled production before they built the apparatus to measure whether the production was worth it. They incentivized usage as if usage were the achievement. They optimized the wrong variable.
If you've read my working paper, this should sound familiar, because it's the same inversion I keep coming back to, just seen from the budget side instead of the build-pipeline side. The entire argument of Orchestrated Engineering is that when implementation cost collapses, the bottleneck moves somewhere else: to verification, to governance, to orchestration. Uber removed the implementation constraint and immediately slammed into a different one. They could produce an enormous amount. They could not tell you whether what they produced was good, or worth the spend. "I can't draw a line between the usage and the features shipped" is not a cost complaint. It's a verification complaint wearing a cost complaint's clothes.
The constraint was never the cost of producing code. Treating cheap production as its own reward just reproduces the oldest mistake in the field: the assumption that output volume is the measure of progress. It wasn't true when we measured engineers by lines of code, and it isn't true when we measure them by tokens consumed.
The tell is the leaderboard
I keep coming back to the leaderboard, because it's such a perfect artifact of the confusion.
Ranking teams by token usage is the 2026 version of ranking developers by lines of code written. It feels like measuring productivity. It's actually measuring cost. A team at the top of that leaderboard isn't necessarily shipping more value; it's necessarily spending more money. You have built an incentive structure that rewards your people for running up the bill, and then you're surprised when the bill runs up.
The mitigation isn't to stop using the tools. It's to measure the right thing. Instrument the relationship between work produced and value shipped, not the volume of tokens consumed. Route work to the cheapest tier that actually solves it, because most tasks don't need the frontier and the good-enough model is sitting right there at a fraction of the price. Set budgets tied to outcomes rather than usage targets. And whatever you do, don't gamify consumption. Usage is a cost, not an accomplishment.
Why this matters beyond the headline
Gartner forecasts that more than 40% of agentic AI projects will be canceled by the end of 2027, on account of escalating costs, unclear business value, and inadequate risk controls. When I first wrote that down it was a projection, a number from a research firm. As of this month it has named instances. Uber and Microsoft are early entries in the 40%. Not because their engineers are bad or the tools are fake, but because they did the thing the forecast describes: they let costs escalate while the business value stayed unclear, because they hadn't built the discipline to make it clear.
That discipline has a name, and it's the thing I'm trying to describe with all of this. Cost-value measurement isn't a finance function you bolt on after the fact. It's part of verification. Knowing whether the work was worth it is not separable from knowing whether the work was correct. Both are the new job. Both are what survives after the model gets cheap.
The model got cheap. The bill went up anyway. The companies that figure out why, and rebuild around it, are the 60%. The ones that keep score by the meter are the 40%.
This extends the argument in Software Engineering 3.0, my working paper on how implementation abundance restructures software engineering. Section 9.8 covers cost-value decoupling as a named failure mode if you want the formal version.