You Know What a Token Is. You Don't Know What It Costs You.

Ask any engineering leader what a token is and you'll get a reasonable answer: roughly four characters of text, or three-quarters of an English word. The conceptual definition is well understood. What is not well understood is the economic structure that sits on top of that definition — and the gap between "I understand tokens" and "I understand token economics" is where most AI budgets quietly blow up.

This post is not a primer on LLM basics. It's a primer on the cost mechanics that determine whether your AI investment is financially disciplined or a slow leak. By the end, you should be able to make model selection and architecture decisions with unit economics in mind, not just benchmarks.

Why Tokenizers Differ — and Why That Matters for Budgeting

Different models use different tokenizers. GPT-4o uses a BPE tokenizer (cl100k_base or o200k_base depending on vintage). Claude models use a tokenizer trained on a different corpus. Llama-based models have their own. These are not interchangeable, and that has direct cost consequences.

The practical implication: the same prompt produces a different token count depending on the model you're calling. A 500-word system prompt might tokenize to 420 tokens on one model and 510 on another. At scale, across millions of calls, that 20% variance is material.

More consequentially, tokenizers differ in how they handle structured data. JSON payloads, code, SQL, and XML tend to tokenize more efficiently on models trained on large code corpora. If your primary workload involves structured input, running a quick tokenization comparison before committing to a model is not pedantic — it's basic due diligence. A model that appears cheaper at list price may be consuming 15–25% more tokens on your actual workload.

The right mental model: token count is a function of both your content and your model choice. Never assume portability.

Output Tokens Cost More Than Input Tokens. This Is Not a Rounding Error.

Every major model provider prices output tokens at a premium over input tokens. The ratio varies, but two-to-four times is common. On Claude Sonnet, for example, input tokens and output tokens sit at meaningfully different price points. On GPT-4o, the gap is similarly structured.

The reason is partly architectural (autoregressive generation is computationally more expensive per token than prefill), partly market-driven (output is where the value is perceived to be). The reason matters less than the implication.

The implication: your cost curve is not determined primarily by how much context you send in. It's determined by how much output your application generates. This inverts how most teams think about optimization. They focus on trimming prompts — reducing input token count — when the higher-leverage intervention is constraining output length.

Concretely: if you're building a summarization pipeline and you tell the model to "write a comprehensive summary," you will get a longer output than if you tell it to "write a summary in 150 words or fewer." The difference in output tokens between those two instructions can be 3–5x. The difference in cost follows accordingly.

Output token discipline is the single highest-leverage optimization most teams aren't doing. Explicit length constraints, output format specifications, and structured output schemas all reduce output token count without sacrificing quality. This is engineering work, not prompt wizardry.

Context Window Size and the Non-Linear Cost Curve

Context windows have expanded dramatically. 128K, 200K, and even 1M-token context windows are now available. This is genuinely useful for long-document workflows. It is also a cost trap that catches teams off guard.

The mechanics: input tokens accumulate with context length. If you're running a conversational application and you're naively appending every prior turn to each new API call, your input token count grows linearly with conversation length. A 20-turn conversation where each turn adds 200 tokens means your final call sends 4,000 tokens of history that your first call didn't. Multiply that by your daily active user count and the cost profile is not a flat line — it's a ramp.

Context window economics also interact with caching. Anthropic's prompt caching feature (and equivalent mechanisms from other providers) allow frequently repeated context — system prompts, document context, example sets — to be cached at a discounted rate. This is not a trivial discount. Cached input tokens are typically 80–90% cheaper than uncached. For applications with stable, long system prompts, prompt caching is not optional infrastructure — it's a cost requirement.

The Caching Calculus

Prompt caching works when the prefix is stable and repeated. If your system prompt is 2,000 tokens and you make 10,000 calls per day, the caching savings at scale are significant. If your system prompt changes per user or per request, caching doesn't apply. Architecture decisions — whether to put dynamic content inside or outside the cached prefix — directly determine whether you can capture these savings. Most teams discover this after the fact.

Why "Cost Per Token" Is a Misleading Metric

The industry defaults to cost-per-token as the unit for comparing models. It is the wrong unit. It measures price per atom of compute without accounting for the actual job being done.

Consider two models:

  • Model A: $3.00 per million tokens, requires 800 output tokens to complete a task accurately.
  • Model B: $5.00 per million tokens, requires 300 output tokens to complete the same task accurately.

Model A is cheaper per token and more expensive per task completion. This is not an edge case. It's a pattern. Smaller, cheaper models often require more verbose output to achieve the same quality on complex tasks. Larger models are frequently more token-efficient in their responses — they get to the point. The apparent price premium evaporates when you calculate cost per outcome rather than cost per token.

The correct unit for model comparison is cost per successful task completion — call it cost per outcome. This requires knowing your P50 and P90 output token counts for your actual workload, your retry rate (failed completions that re-run), and your quality threshold. None of this comes from the provider's pricing page. All of it requires measurement.

Model Selection Through a Unit Economics Lens

Most model selection decisions are made on the basis of benchmark performance and qualitative assessment. Neither is wrong, but neither is sufficient. The missing component is a unit economics analysis — and it's not difficult to construct.

The framework:

  1. Define the task type — extraction, generation, classification, reasoning, code synthesis. These have different token profiles.
  2. Measure token consumption on a representative sample — not a toy example. Run 100 real requests through each candidate model and record input tokens, output tokens, and quality ratings.
  3. Calculate cost per outcome — total token cost divided by number of acceptable completions.
  4. Model the cost curve at your projected volume — costs that look manageable at 10,000 requests per day look different at 1,000,000.
  5. Account for architectural leverage — caching eligibility, batching options (async batch APIs typically offer 50% discounts), streaming versus synchronous.

This is not a one-time analysis. Model prices change. New models release. Your workload evolves. Teams that build this evaluation into their deployment process — rather than treating model selection as a one-time architectural decision — maintain better cost trajectories over time.

Architecture Patterns That Compound Token Costs

Beyond individual model calls, the architecture of your AI application determines your token economics at a structural level. Several patterns are disproportionately expensive and often go unexamined.

Naive RAG Pipelines

Retrieval-augmented generation is the standard pattern for grounding LLMs in organizational data. A poorly designed RAG pipeline stuffs the top-N retrieved chunks into context regardless of relevance score. Top-5 chunks of 500 tokens each is 2,500 tokens of context on every call, whether you needed all five or only one. Reranking, relevance thresholding, and dynamic context sizing are all techniques that reduce average input token consumption in RAG architectures — often by 40–60% without quality loss.

Multi-Step Agent Chains

Agentic workflows compound token costs in a way that single-call applications do not. Each step in a chain carries the conversation history from prior steps. A five-step agent loop where each step adds 500 tokens of tool results means the final call is processing 2,000 tokens of accumulated context the first call didn't pay for. Agent architectures require explicit context management strategies — summarization of prior steps, selective history inclusion, or handoff protocols that reset context — to prevent cost escalation on long-running tasks.

Redundant System Prompts

In applications where multiple teams or features share an API integration, it's common to find system prompts that have grown organically — multiple authors appending instructions over time. System prompts of 3,000–5,000 tokens are not unusual in mature applications. Every single API call pays for those tokens. A periodic audit of system prompt token count, combined with aggressive consolidation, is a maintenance task most engineering teams skip but shouldn't.

What This Means Operationally

Token economics is not a finance problem. It's an engineering problem with financial consequences. The teams that manage it well treat token consumption as a first-class engineering metric alongside latency, error rate, and availability.

This means instrumenting every AI call with token counts — input, output, and cached. It means tracking cost per task type, not just aggregate spend. It means setting output length budgets in your prompts as a default, not an afterthought. And it means building model evaluation into your deployment workflow so you're always running the right model for the right task at the right price.

Platforms like Oberhahn surface exactly this kind of token-level attribution across models, teams, and workloads — making the economics visible so engineering leaders can act on them rather than discover problems in the next billing cycle.

The token economy rewards the teams that study it. The ones that don't pay a premium that compounds quietly, month over month, until someone finally asks why the AI line item looks the way it does.