What Is My OpenAI Bill Actually Buying?

In the late 1800s, railroad companies were among the most valuable businesses in America. Investors couldn't get enough of them. Newspapers tracked their expansion relentlessly, and executives proudly announced new routes stretching farther west each year. Growth was easy to understand because there was a simple metric everyone could point to: miles of track laid.

More track meant more progress.

At least, that's what people thought.

The problem was that laying railroad track and running a successful railroad were two very different things. Some companies became obsessed with expansion. They raced competitors into new territories, announced ambitious construction projects, and celebrated every additional mile completed. Looking at a map, it was easy to conclude they were winning.

Yet many of those same railroads struggled financially. The tracks existed. The capital had been invested. The growth stories sounded compelling. What mattered, however, wasn't how much track had been built. What mattered was whether those tracks were carrying passengers and freight in sufficient quantities to justify the cost.

Eventually investors started asking different questions. They became less interested in how much steel was in the ground and more interested in what was moving across it. A railroad's success wasn't determined by the size of its network alone. It was determined by whether that network created value.

The measurement everyone focused on turned out to be a proxy.

I've been thinking about that story recently because many companies are approaching AI in a remarkably similar way. Most organizations know exactly what they spent on AI last month. They know how many tokens were consumed, which models were used, and what the final invoice looked like. If someone asks for the AI bill, the answer is usually available within minutes.

What is much harder to answer is what the organization received in return.

The invoice tells you how much AI was consumed. It doesn't tell you which teams generated the spend, which products relied on it, which workflows benefited from it, or whether any of it produced measurable business value. In many cases, it doesn't even tell you whether the system succeeded or failed.

A year ago, that distinction wasn't especially important. AI budgets were largely experimental. Teams were exploring use cases, testing ideas, and figuring out where the technology fit inside the organization. Experiments are allowed to be inefficient because learning itself is the goal.

Today, AI is increasingly becoming operational infrastructure. Engineering teams are building internal agents. Support organizations are automating ticket resolution. Finance departments are using AI for forecasting, reporting, and analysis. Product teams are embedding models directly into customer-facing experiences. As AI moves into operating budgets, the expectations around measurement begin to change.

Consider two AI agents handling customer support requests. One successfully resolves 100 customer issues and leaves users satisfied. The other fails the same request repeatedly, generates retries, escalates unnecessary work, and never produces a useful outcome.

From the perspective of the invoice, those agents can look surprisingly similar.

Both consumed tokens. Both generated requests. Both contributed to monthly spend. Nothing on the invoice clearly distinguishes successful outcomes from unsuccessful ones.

That creates one of the largest blind spots in enterprise AI today. Failed outputs cost money. Retries cost money. Hallucinations cost money. Poorly designed workflows cost money. The invoice captures all of that activity, but activity and value are not the same thing.

You can see this challenge appearing in software development as well. One engineering team might use AI coding tools to dramatically accelerate feature delivery, reduce development cycles, and ship products faster than before. Another team might deploy the exact same tools, incur similar costs, and experience only modest gains.

Looking at usage data, the teams appear comparable.

Looking at business outcomes, they are not.

That's why I suspect the most important AI metric over the next few years won't be tokens consumed. Tokens are useful for understanding utilization, just as miles of railroad track were useful for understanding expansion. Neither tells you whether the investment generated meaningful returns.

What organizations increasingly need to understand is output generated per dollar spent. How many support tickets were resolved? How many features shipped? How much revenue was influenced? How much analyst time was saved? How much work was actually completed because AI was involved?

Those questions are significantly harder to answer than counting tokens. They're also the questions that determine whether AI becomes a lasting advantage or simply another line item on the budget.

The next generation of AI infrastructure probably won't focus on helping companies generate more usage. Most organizations already know how to generate usage. The more difficult challenge is understanding whether that usage is creating value.

The railroad companies that ultimately succeeded weren't necessarily the ones that laid the most track. They were the ones that figured out how to move the most value across the networks they built. At some point, investors stopped rewarding expansion for its own sake and started rewarding results.

AI is heading toward the same moment.

As the technology continues moving from experimentation into operations, the conversation inevitably shifts from activity to outcomes. And when it does, every CFO ends up asking the same question:

What did we actually get for last quarter's AI bill?