The spreadsheet is always the same. Someone pulls the published pricing pages for the models under consideration, builds a table with input and output token costs, and uses those numbers to make the cost case for one model over another. It looks rigorous. It is not.

Token-level price comparison is a trap for three reasons. First, different models use different tokenizers — a given piece of text tokenizes to different token counts depending on the model processing it, sometimes by 15-30%. Second, different models produce different output lengths for the same task — a more verbose model costs more even if its per-token price is lower. Third, different models have different context efficiency, meaning you may need more context to get the same quality output from one model than another, which affects your input token counts.

If you're making model selection decisions based on published token prices without accounting for these factors, you are not doing cost benchmarking. You're reading a menu and calling it a grocery bill.

This post explains how to benchmark LLM costs properly, at the task level, and gives you a framework you can run on your own workloads.

Why Token-Level Comparison Misleads

The Tokenizer Problem

Every major LLM uses its own tokenizer, which means the same input text produces different token counts across providers. OpenAI models use the tiktoken family of tokenizers. Anthropic's Claude models use a different tokenizer. Google's Gemini models use yet another. The differences aren't random — they reflect different training decisions and vocabulary choices — but from a cost perspective, what matters is that a 500-word prompt does not produce 500 tokens across all models. It might produce 380 tokens in one tokenizer and 420 in another.

For a single call, this is a rounding error. At a million calls per day, it's a material cost difference that has nothing to do with model quality. If you're comparing models on token price without measuring actual token counts on your real inputs, your comparison is wrong by construction.

The Output Length Problem

Models vary significantly in output verbosity for the same task. Ask GPT-4o and Claude 3.5 Sonnet to answer the same question and the responses will differ not just in quality but in length. This isn't inherently good or bad — sometimes a more verbose answer is better, sometimes it isn't — but it's directly relevant to cost comparison because output tokens are almost always priced higher than input tokens.

A model that produces 40% more output tokens on average costs 40% more on the output side, even if its per-token price is identical. For output-heavy workloads, this can easily flip the cost comparison. A model with a nominally higher per-token output price can end up cheaper in practice if it's more concise.

The Context Efficiency Problem

Some tasks require you to provide context to get good results — examples, instructions, retrieved documents. Different models extract information from context at different efficiency levels. A model that needs a longer, more explicit system prompt to perform reliably will cost more on every call, even before generation, because the system prompt counts as input tokens.

This effect is particularly pronounced in enterprise use cases where system prompts carry detailed instructions, persona information, or policy constraints. A system prompt that needs to be 2,000 tokens to reliably constrain one model's behavior might only need to be 800 tokens for another. At high call volumes, that input token difference is a significant cost factor.

The Right Framework: Task-Level Benchmarking

The solution is to benchmark at the task level, not the token level. A task-level benchmark measures the total cost to complete a specific unit of work using a specific model, measured on real or representative inputs from your actual workload.

Here's the framework:

Step 1: Define Your Task Types

Start by categorizing your LLM workload into distinct task types. Common examples include:

  • Question answering over a corpus
  • Document summarization
  • Classification (intent, sentiment, category)
  • Code generation or explanation
  • Structured data extraction
  • Long-form text generation
  • Conversational response

Each task type has different prompt structures, different typical input lengths, different output length patterns, and potentially different quality requirements. A benchmarking framework that aggregates across task types will mislead you — you need per-task comparisons.

Step 2: Sample Real Inputs

For each task type, pull a sample of real inputs from your production logs — at least 100, ideally 500 or more. These should be representative of the actual distribution of inputs your system processes, not cherry-picked examples. If your distribution is bimodal (you have both short and long inputs), sample from both modes.

Synthetic or hand-crafted test cases are a common shortcut that produces misleading benchmarks. Real inputs have the noise, variation, and edge cases that matter for both quality and cost measurement.

Step 3: Measure Total Cost Per Task Across Models

For each model in your comparison set, run the sampled inputs through your actual prompt templates (not simplified versions) and measure:

  • Input token count (as counted by the model's tokenizer)
  • Output token count
  • Total cost at published pricing
  • Wall clock latency

Compute the per-task cost as input tokens × input price + output tokens × output price, then average across your sample. This gives you the average cost per task for each model on your real workload.

Step 4: Measure Task Quality

Cost comparison without quality measurement is just cost measurement. You need to know whether a cheaper model produces acceptable output for the task in question.

Quality measurement approaches vary by task type:

Task TypeQuality MetricMeasurement Method
ClassificationAccuracy, F1Labeled test set comparison
ExtractionPrecision, recall on fieldsLabeled test set comparison
SummarizationFaithfulness, coverageLLM-as-judge or human eval
Code generationTest pass rate, correctnessUnit test execution
Q&AAnswer accuracy, hallucination rateLLM-as-judge or human eval
Long-form generationCoherence, instruction followingHuman eval or rubric-based LLM judge

For each model, compute a quality score and plot it against cost-per-task. The goal is to find the cost-quality frontier — the set of models where no option is both cheaper and better. Any model inside the frontier is dominated and should be ruled out.

Step 5: Compute Cost-Per-Acceptable-Task

Here's the metric that actually matters: cost per successfully completed task, where success is defined by your quality threshold.

If Model A costs $0.01 per task and produces acceptable output 92% of the time, and Model B costs $0.007 per task and produces acceptable output 71% of the time, Model B is not actually cheaper. Adjusting for quality:

  • Model A: $0.01 / 0.92 = $0.0109 per acceptable task
  • Model B: $0.007 / 0.71 = $0.0099 per acceptable task

In this example, Model B is still slightly cheaper on a quality-adjusted basis — but the gap is much narrower than the raw per-task comparison suggests, and depending on the consequences of a failed task (a user gets a bad answer, a downstream system receives corrupted data), the difference may not justify switching.

Sample Comparison Methodology

To make this concrete, here's what a task-level benchmark for a document Q&A task might look like:

ModelAvg Input TokensAvg Output TokensCost/TaskQuality ScoreCost/Acceptable Task
GPT-4o2,847312$0.015791%$0.0173
GPT-4o mini2,891287$0.002374%$0.0031
Claude 3.5 Sonnet2,634298$0.012193%$0.0130
Claude 3 Haiku2,701241$0.000961%$0.0015

Notice a few things in this hypothetical comparison. GPT-4o and Claude 3.5 Sonnet tokenize the same input differently — the Claude model produces 213 fewer input tokens on average. Both are competitive on quality. On a per-acceptable-task basis, Claude 3.5 Sonnet is cheaper despite higher per-token pricing on some tiers. GPT-4o mini and Claude 3 Haiku are dramatically cheaper but the quality gap is significant enough that neither is appropriate for a task where answer accuracy matters to the user experience.

This is what benchmarking actually looks like. The right model choice depends on your quality threshold, which depends on your application context — not on which provider has the lowest number on a pricing page.

Common Mistakes in LLM Cost Benchmarking

Benchmarking on Simplified Prompts

Your production system prompt is not “Answer the following question.” It has instructions, constraints, persona, format requirements, and safety guardrails. Running benchmarks on stripped-down prompts produces token counts that bear no resemblance to production reality. Always benchmark on your actual prompt templates.

Ignoring Prompt Caching Effects

Several providers offer prompt caching — if you send the same prefix (typically a system prompt) across many calls, subsequent calls with that cached prefix are cheaper. If your workload has a long, fixed system prompt, caching can change the cost calculation significantly. Make sure your benchmark accounts for whether caching is active and what hit rate you'd expect in production.

Benchmarking Once and Treating It as Permanent

Models get updated, prices change, and new models launch regularly. A benchmark from six months ago may not reflect current prices or current model behavior. High-spend use cases warrant re-benchmarking whenever a major model update or price change occurs.

Forgetting Latency

Cost and latency are related but not identical constraints. A model that's 30% cheaper but 2x slower may be appropriate for an async processing pipeline and completely inappropriate for a real-time user-facing application. Include latency in your benchmarks and apply it as a constraint before running the cost comparison.

Building a Benchmarking Practice

Ad hoc benchmarks done once at project kickoff decay quickly. The teams that manage AI costs effectively treat benchmarking as an ongoing practice:

  • Maintain a benchmark suite for each major task type in production
  • Re-run benchmarks when significant model updates are announced or new models become available
  • Track cost-per-task trends over time to detect model drift or prompt efficiency degradation
  • Gate significant prompt changes on benchmark results, not just anecdotal quality assessment

This requires infrastructure — a way to run benchmarks automatically, store results, and surface regressions. It's not a major engineering investment, but it needs to be prioritized as a first-class concern, not an occasional exercise. Platforms like Oberhahn can surface cost-per-task trends from production data, which gives you a continuous signal rather than a periodic snapshot.

The Bottom Line

Comparing models on published token prices is not cost benchmarking. It's pricing research, which is a starting point, not a conclusion. The actual cost of a model for your workload depends on your tokenizer behavior, your output length patterns, your prompt structure, and your quality requirements — none of which are visible on a pricing page.

Task-level benchmarking is more work. It's also the only method that produces numbers you can make real decisions with. If you're making model selection decisions without it, you're guessing — and at the scale where these costs matter, guessing is expensive.