The Assumption That Makes Your OpenAI Bill Worse

Most engineering teams attack their OpenAI bill by rewriting prompts. They trim tokens, compress instructions, and shorten system messages. Sometimes it works a little. More often, it costs hours of engineering time and moves the bill by three percent. The real savings are almost never in the prompts.

They are in the infrastructure around the prompts. And to understand why, it helps to look at what happened in lower Manhattan in the fall of 1882.

Edison's Invisible Problem

When Thomas Edison switched on the Pearl Street Station on September 4, 1882, he created the first commercial electrical grid in the United States. Within weeks, fifty-nine customers in lower Manhattan were paying for electricity. And almost immediately, Edison noticed something strange: the bills were shocking, not because the rates were high, but because nobody could explain them.

Customers knew what they were paying. They did not know which devices were drawing the power. Gas lighting had trained people to think of energy as a fixed cost tied to a simple on-off decision. Electricity was different. Devices varied wildly in consumption. A motor left running in a back room could account for more than a lamp burning all night in the front. But you could not tell just by looking.

The breakthrough did not come from changing the devices. It came from instrumentation. Once Edison's engineers started measuring consumption at the device level rather than the building level, the wasteful patterns became obvious. Machines left running overnight. Redundant motors for tasks that needed one. Equipment cycling on and off in ways that consumed more power than running continuously would have.

The bill went down when visibility went up. That sequence matters.

Your OpenAI Bill Has the Same Structure

The modern AI cost problem maps almost exactly onto Edison's 1882 problem. You can see the invoice total. You almost certainly cannot see which workflows, which features, or which team's experiments are driving the variance. That gap between total and cause is where the real savings live.

Here are the four places teams consistently find meaningful reductions, none of which require touching a prompt.

Request Caching

A significant percentage of API calls in any production system are identical to calls that were made minutes or hours earlier. Classification tasks, document tagging, and any workflow that runs the same input through the same model will generate redundant calls unless you cache at the request level. Semantic caching, which matches inputs by meaning rather than exact string, can catch even more. Teams that implement caching on their highest-volume workflows routinely see 20 to 40 percent reductions in call volume with no change in output quality.

Model Routing

GPT-4o is not the right model for every task. It is priced for the tasks that require its reasoning depth, which is a smaller share of most production workloads than teams assume. Routing classification tasks, intent detection, simple transformations, and low-stakes summarization to cheaper models like GPT-4o-mini creates a tiered cost structure that matches spend to task complexity. The engineering work is a routing layer, not a prompt rewrite.

Deduplication in Agent Loops

Agentic workflows introduce a specific failure mode: retry loops that call the API multiple times for the same task when an upstream step fails. If an agent hits an error and retries three times before escalating, you have paid for four API calls and received one useful output. Deduplicating retries, adding idempotency keys, and capping retry counts are infrastructure changes that reduce cost without changing what the agent does when it succeeds.

Attribution

This is the Edison lesson in its purest form. You cannot reduce what you cannot measure. Adding cost attribution by feature, by team, by user segment, and by workflow transforms the API invoice from a single opaque number into a diagnostic tool. When attribution is in place, cost spikes have causes. When it is not, they have only theories.

The Order of Operations

The teams that reduce their OpenAI costs fastest do attribution first, then routing, then caching, then deduplication. They spend the first week making the problem legible before they spend any time solving it. The teams that start with prompts usually spend several weeks on the thing that returns the least.

Edison's engineers did not walk through lower Manhattan unscrewing light bulbs to reduce electricity consumption. They built meters, found the motors running overnight, and turned those off first. The principle is the same for AI infrastructure: find the waste that is already there before you optimize the things that are working.

Oberhahn is built around this sequence. Cost attribution first, optimization second, so the savings you find are the ones that actually move the number.