Among all the challenges of implementing agentic artificial intelligence, the least-understood issue is cost. The providers of AI, such as OpenAI, Google, and Anthropic, have price lists, but none of those listed prices tell users what the final bill will be to actually solve a problem. The result, according to a new study of costs from the University of Michigan and collaborating institutions, could be sticker shock: soaring and unpredictable costs of agents.
The study, by lead author Longju Bai of Michigan and collaborators at Stanford University, All Hands AI, Google's DeepMind unit, Microsoft, and MIT, is titled "How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks." The authors describe it as "the first systematic study on AI Agent token consumption." The study was posted on the arXiv pre-print server. It includes as an author prominent Stanford economist Erik Brynjolfsson, who has commented extensively on AI's impact on productivity.
The token cost explosion
The top-level finding is that agents consume orders of magnitude more tokens than turn-by-turn, simple, prompt-based chats. Think 3,500 times the number of tokens for an agent as for a round of prompts with ChatGPT. A token is the fundamental unit of information processed by an AI model. It could be a piece of a word, a whole word, or just a punctuation mark, depending on how a model chops data into pieces. While one might expect agents to cost more in tokens, the study reveals more alarming facts: Two different models can have wildly different token costs for the same task. And the same model can have different costs each time that it works on the same problem, using as many as twice the number of tokens on one occasion compared to another.
The worst part is that none of this can be predicted. Agents, Bai and team found, cannot reliably estimate how many tokens they will ultimately consume for a given task. "Agentic tasks are uniquely expensive," they wrote. More tokens don't necessarily improve results. "Simply scaling token usage may not lead to higher execution performance," they wrote, and, "Models systematically underestimate the tokens they need." The rising cost and the uncertainty of success are in no way accounted for in today's price lists from OpenAI and others. The work suggests there is no easy fix to the matter. The best users can do is to set hard limits on agentic computer use, possibly causing agents to halt before completing tasks.
Counting token costs
To study costs, Bai and team used the open-source agentic AI framework OpenHands, developed by scholars at the University of Illinois Urbana-Champaign and collaborating institutions. They used OpenHands to build agents, which they then tested on the open-source coding benchmark test SWE-Bench. The SWE-Bench tasks are taken from actual GitHub issues. They first found the relative strengths of models: OpenAI's ChatGPT 5 and 5.2 achieve strong accuracy at low cost, though they are not the most accurate. Anthropic's Claude Sonnet-4.5 achieved the highest accuracy but at higher token costs. Google's Gemini-3-Pro was somewhere in the middle. And the Kimi-K2 model from Chinese AI lab Moonshot may have the worst relative mix: the most tokens to achieve the lowest accuracy.
The authors suggested the difference in tokens is based on unique properties of how models are architected: "The gap is not driven by task difficulty or by some models attempting harder problems. Instead, the same task is simply more expensive for some models than others, reflecting a behavioral tendency of the model rather than a property of the problem." But the issue is not one of better or worse models because even the same model can take twice as many tokens to solve the same problem from one run to the next. "The most expensive runs double the token and monetary cost of the least expensive runs," they observed, "suggesting that the agent's token consumption has large variances even when working on exactly the same problem."
The lesson is that more tokens don't necessarily get you better results. In fact, the authors found that work can get worse the longer an agent spends on a task. "Accuracy often peaks at intermediate cost and saturates at higher costs," they observed. "Agent behavior becomes increasingly unstable on more complex tasks." Many models seem to search and search to solve a problem even when it's fruitless. "Models lack a reliable mechanism to recognize when a task is unsolvable and stop early," wrote Bai and team. "Instead, they continue exploring, retrying, and re-reading context, accumulating cost without progress."
Unable to predict costs
Those factors make "token usage prediction and agent pricing a fundamentally challenging task," wrote Bai and team. And, in fact, the bot itself cannot predict when asked to introspect. Bai and team asked each AI agent to predict its tokens using the prompt: "I've uploaded a python code repository in the directory example repo. You are a TOKEN ESTIMATION agent. Estimate the token cost to fix the following issue description," and then the problem description. What they found is that agents can approximate to a small degree how many tokens will be used, but their predictions tend to be too low. "Models consistently underestimate the tokens they need," wrote Bai and team. "The bias is especially pronounced for input tokens, whose predictions stay compressed even as real values grow into the millions."
Watch those inputs
That last point, about input tokens, has a special prominence in the report. Bai and team found that input tokens, such as what's typed by the human user, and what is retrieved via tools such as database searches, dominate the cost in tokens. The other two types of tokens—the output, which is generated, and the cached tokens held in memory from prior stages—are far less demanding. "Strikingly, input tokens, not output tokens, dominate the overall cost in agentic coding." The reason is that "agentic workflows accumulate the information from different sources and the same context gets fed into the models repeatedly." As a result, there is a "dramatically higher input/output ratio" for agentic AI than for single-prompt or multi-prompt AI sessions with a bot. And, drilling down even further, the most expensive input token factor is when the agent retrieves prior information from memory. "We find that cache reads dominate both raw token volume and dollar cost," Bai and team wrote. "In every phase, cache-read input tokens are the largest category by a wide margin, reflecting the cumulative reuse of prior context."
What can be done?
The authors don't have many suggestions. One proposal is that even if agents can't predict the number of tokens, they can make some guesses at a high level, a "coarse-grained" estimate for token cost. "This suggests that agent-driven estimation can potentially support early budget alerts before launching expensive runs, improving cost transparency without overpromising precise token-level accuracy," they wrote. Since input tokens are the biggest cost element, users should think carefully about what can be controlled at input. The size of prompts is one factor that drives input tokens higher. The context window used with an agent, wider or narrower, affects token count at input. And the number of tools called by the agent, such as databases, will bring lots more input tokens into play.
There's only so much you can do as a user, however. Something more will have to be done on an industry-wide basis. The problems outlined are clearly those of a young industry, and one where vendors will have to be pushed by users to change practices. The lack of transparency as to what an agent might cost to do a task is way too vague for enterprises that need to be able to plan investments in software. The burden is pushed onto the user to run agentic tasks in an experimental capacity over and over in order to get anything like an average cost to use as an estimate for planning purposes. And the lack of guarantees of success—even after the agent burns through tokens—is the most glaring problem. That means enterprises could waste vast amounts of money just running tokens. Users collectively are going to have to push back on vendors such as OpenAI, Google, and Anthropic and demand price transparency and some form of guarantee that a task will be completed, or else the entire exercise of agentic AI may be dominated by cost overruns and failed implementations.
Such deep problems are probably already being encountered by early adopters. They may be content to pay such a high cost to be among the first to get an agentic edge. It's not a situation, however, that can lead to stable, steady use of agentic AI. The industry is at a crossroads: either vendors provide clarity and accountability, or the promise of agentic AI will be undermined by economic unpredictability and waste.
Source: ZDNET News