The Bill Arrived. Now What?
You've seen the email. OpenAI usage this month was significantly higher than last month. Maybe it's 2x. Maybe it's 5x. Maybe it's a number that's going to be hard to explain on the next budget review. Your first instinct is to log into the OpenAI usage dashboard and look for the cause.
Here's the problem: the usage dashboard shows you totals. Tokens in, tokens out, requests, by model, by day. It does not show you which team, which workflow, which feature, or which engineering decision caused the spike. You're looking at an aggregate that hides everything that would actually be useful. It's the equivalent of trying to debug a performance problem with only a total CPU percentage — you can tell something is wrong, but you can't tell what or where.
OpenAI API costs going out of control is a real and common problem. But the failure mode isn't usually a single obvious cause. It's usually one or more of five specific patterns that compound and interact. This post names them explicitly and explains what you need to see in order to fix each one.
Cause 1: Context Window Accumulation in Agentic Workflows
This is the most common cause of unexpected cost spikes in organizations running agentic workflows, and it's the least obvious until you understand how context pricing works.
Every token in the context window costs money — not just the tokens you're generating, but every token of context you're including. In a multi-turn agentic workflow, context accumulates. The agent adds tool results, intermediate reasoning, prior conversation turns, and retrieval results to the context with every step. By the fifth step of a 10-step workflow, you might be including 8,000 tokens of accumulated context to generate 100 tokens of output. The input token cost alone dwarfs what you'd expect if you thought only about the output.
The compounding effect is severe. A workflow that costs $0.01 per run at step one might cost $0.15 per run by step ten — not because the output is longer, but because the context window is massive. At scale, this difference is the difference between a sustainable cost model and a runaway bill.
What to look for:
- Input token to output token ratios that are much higher than expected — input token cost dominating output token cost is a sign of context accumulation
- Cost-per-run increasing as workflows run longer or handle more complex tasks
- Any workflow that maintains state across multiple model calls without explicit context trimming
Cause 2: Retry Loops Without Cost Awareness
Retry logic in API systems is standard engineering practice. A call fails; you wait; you retry. The assumption baked into most retry patterns is that retries are rare and the cost of a retry is equivalent to the cost of the original call. In LLM systems, both assumptions can be wrong.
First, retries aren't always rare in LLM systems. Model timeouts, rate limit errors, and context length errors can be common at scale, especially for workflows that weren't designed with production load in mind. A retry rate of 10% on a high-volume workflow means 10% more spend right there. A retry rate of 30% on a workflow that's already expensive per call compounds quickly.
Second, the retry may not be equivalent in cost to the original call. If your retry logic doesn't reset the context, the retry call might include the full context from the failed attempt plus additional error-handling content — making it more expensive than the original call. Bad exponential backoff implementations can retry indefinitely. Infinite retry loops on expensive models are a category of billing surprise that has caused real incidents.
What to look for:
- Request counts that don't match your expected traffic — more requests than user actions or workflow invocations suggests retries
- Any workflow that doesn't have explicit retry limits and cost-based circuit breakers
- Error rate trends that correlate with cost spikes
Cause 3: Model Version Drift
OpenAI updates model aliases regularly. When you call gpt-4o, you're calling whatever the current version of that alias points to — which may have changed since you last checked, and may be priced differently than the version you tested against. More subtly, even without a price change, a model update can change the verbosity of outputs, which changes your output token count, which changes your cost.
Organizations that pin to aliases rather than specific versioned model IDs are exposed to this drift. Your cost model is based on the model behavior and pricing when you built the workflow. If either has changed — and both do change — your cost model is wrong.
The more common version drift issue is deliberate model upgrades that weren't fully costed. A team migrates a workflow from GPT-3.5-level capabilities to GPT-4o because the quality improvement is compelling. The cost per call increases by 10x to 20x. If that workflow runs at any significant volume, the budget impact is material — and if the migration happened without a formal cost review, the spike looks unexplained.
What to look for:
- Spend-per-request trends by model — cost-per-call should be relatively stable; increases suggest model version changes
- Which model versions are in production and whether any were recently changed
- Model aliases versus pinned version IDs in your codebase
Cause 4: Development and Experimentation Spend Leaking into Production Accounting
Most organizations don't have strong environment separation for their AI API usage. Developer testing, CI/CD pipeline runs, and production traffic all flow through the same API key, or at best through keys that aren't tagged differently in the billing system. When costs spike, it's impossible to distinguish whether production traffic increased, whether a developer was running expensive experiments, or whether a CI pipeline started running more evaluation tests.
This problem is more significant than it sounds. Development and experimentation spend can be substantial — a team running evaluations against a dataset of 10,000 examples at GPT-4o pricing is spending real money on every run of the eval suite. If that spend isn't separated from production spend, your production cost trends are polluted and your optimization efforts are aiming at the wrong target.
What to look for:
- Cost spikes that correlate with CI/CD pipeline activity rather than user traffic
- Large request volumes that don't correspond to any visible user-facing feature activity
- Developers running ad-hoc experiments against production API keys
| Cost Cause | Detection Signal | Primary Fix |
|---|---|---|
| Context window accumulation | High input:output token ratio | Context trimming strategy in agentic loops |
| Retry loops | Requests exceed expected workflow invocations | Retry limits + cost-aware circuit breakers |
| Model version drift | Cost-per-request increasing without traffic change | Pin to versioned model IDs; cost review for upgrades |
| Dev/prod spend mixing | Spikes uncorrelated with user traffic | Environment-separated API keys with attribution tagging |
| Unbounded agent execution | Per-run cost variance is high; long-tail expensive runs | Step limits and per-run spend caps in workflow design |
Cause 5: Unbounded Agent Execution
Agentic workflows are powerful because they can dynamically determine how many steps to take and what tools to use. This flexibility is also a cost control problem. An agent that decides it needs to run more iterations, retrieve more context, or call more tools than expected will cost more — and there's often nothing in the workflow design that prevents this from happening at extreme scale.
The unbounded execution problem is especially acute when the agent encounters edge cases that cause it to loop, reconsider, or seek additional information. A well-crafted adversarial input — or just an unusual but legitimate query — can cause an agent to run 50 steps where it usually runs 5. If each step involves a model call with accumulated context, the cost of that single run can be orders of magnitude higher than the average.
This isn't a theoretical concern. Production deployments of agentic systems regularly surface long-tail expensive runs that weren't anticipated in cost models. Without per-run spend caps or step limits, a single unusual query can generate more cost than thousands of normal queries combined.
What to look for:
- High variance in cost-per-run — average cost is manageable but there are expensive outlier runs
- No maximum step count or spend cap defined in your workflow configuration
- Agents that can recursively spawn sub-agents without depth limits
What You Actually Need to See
Each of these five causes is diagnosable, but only if you have the right data. The OpenAI usage dashboard doesn't give you what you need. What you need is spend data broken down by:
- Workflow or feature (not just model or total)
- Environment (production, staging, development)
- Model version (specific version IDs, not aliases)
- Request type (initial call vs. retry)
- Cost-per-run distribution (average plus percentiles to surface outliers)
With this data, each cause becomes identifiable within minutes of a cost spike rather than days of forensic analysis. Without it, you're guessing — and guessing wrong is expensive.
Building this instrumentation internally is feasible but non-trivial. It requires a proxy or SDK wrapper that captures request metadata, a data pipeline that aggregates it, and a dashboard that surfaces the right views. Platforms like Oberhahn provide this layer pre-built, so that the first time you need it, you have it — rather than building it under pressure while a spike is actively running.
The Harder Truth
OpenAI API costs going out of control isn't usually a single cause with a single fix. It's usually multiple patterns running simultaneously, each contributing to a total that's hard to parse because the attribution isn't there. The teams closest to the problem often have intuitions about what's causing the spike, but they can't confirm those intuitions without data.
The fix isn't to audit the codebase every time there's a spike. It's to build the visibility infrastructure so that the data is there before the spike happens. Every dollar you spend building attribution and monitoring is worth multiples in cost avoided and time saved. The spike will happen again. The question is whether you'll be able to diagnose it in five minutes or five days.