The pitch for RAG is compelling and the economics look good on the surface. Instead of fine-tuning a model on your proprietary data — expensive, slow, and brittle as that data changes — you retrieve relevant context at inference time and inject it into the prompt. The model stays general. Your data stays current. Costs seem modest on a per-call basis.
Then you do the math. Not just the generation step math, but all of it. What does it cost to embed your corpus? What happens to that cost when your content changes? How much does the retrieval operation itself cost? What does stuffing retrieved context into the prompt do to your token counts — and your latency? What's the infrastructure cost for the vector database you're running?
The real cost of a RAG pipeline isn't the generation step. It's everything before it. Most teams running RAG in production have never fully accounted for these costs, which means they're making architecture and product decisions on incomplete financial data. This post breaks down where RAG costs actually live and how to model them honestly.
The RAG Cost Stack
A production RAG pipeline typically has five cost-generating layers. Most cost analyses only capture layer five.
Layer 1: Initial Corpus Embedding
Before you can retrieve anything, you need to embed your corpus — every document, chunk, or record that you want to make retrievable. This is an embedding API call for each chunk. For a corpus of any meaningful size, this is not trivial.
Let's put real numbers on it. If you're using OpenAI's text-embedding-3-small at $0.02 per million tokens, embedding a corpus of 500,000 chunks at an average of 400 tokens per chunk costs about $4. That sounds cheap. Scale it up: a 50-million-chunk corpus costs roughly $400. A document collection with larger chunks costs more. A corpus that includes long-form documents tokenizes heavily.
More importantly, this is not a one-time cost. It's a cost you pay every time your embedding model changes — and embedding models do change. When OpenAI deprecated text-embedding-ada-002 in favor of the third-generation models, every organization that had built on ada-002 faced a full re-embedding of their corpus. If you made the migration, you paid the initial embedding cost again. If you didn't, you were running an outdated model.
Layer 2: Ongoing Re-embedding as Content Changes
This is the cost that catches teams most off guard. Your corpus is not static. Documents get updated, new content gets added, old content gets removed or revised. Every chunk that changes needs to be re-embedded. Every new chunk needs to be embedded.
The frequency of this re-embedding depends on your content change rate. For a knowledge base that updates weekly, the cost may be modest. For a corpus that ingests new content daily — a news feed, a ticketing system, a live documentation set — re-embedding costs can be significant and ongoing.
Most teams don't track this cost separately. It shows up as a background task, its API calls often logged under a different service or account than the inference pipeline, and it never makes it into the per-query cost calculation. But it's real, and on a high-churn corpus it can rival the generation cost.
Layer 3: Retrieval Infrastructure
You need somewhere to store your embeddings and run similarity searches. That means a vector database or vector-capable search service. The cost structure here depends on your deployment model:
- Managed vector database (Pinecone, Weaviate Cloud, Qdrant Cloud): Typically priced by index size, query volume, and storage. A mid-size production corpus with meaningful query volume can cost several hundred to several thousand dollars per month.
- Self-hosted vector database (Qdrant, Milvus, Chroma): Infrastructure cost instead of SaaS cost. You're paying for the compute and memory to run the service. Vector search is memory-intensive — a corpus of 10 million 1536-dimension vectors requires roughly 60GB of memory just for the index, which means non-trivial instance sizing.
- Approximate nearest neighbor search on existing infrastructure (pgvector, OpenSearch): Often lower dedicated cost but may affect the performance and cost of shared infrastructure.
Retrieval infrastructure is a fixed cost that doesn't scale linearly with query volume, which means the per-query cost looks better at high volume and worse at low volume. But it's real and it needs to be in the model.
Layer 4: The Retrieval Call Itself
Each query to your RAG pipeline requires an embedding call — you need to embed the query in the same vector space as your corpus before you can retrieve. For text-embedding-3-small, a single query embedding costs a fraction of a cent. But at production scale, this adds up.
If you're running 100,000 queries per day, you're making 100,000 embedding API calls per day just for query embedding — before you've generated a single token of response. At current embedding prices this is still modest, but it's a real cost that doesn't exist in a non-RAG architecture, and it compounds with the other layers.
Layer 5: Generation with Retrieved Context
This is the step most teams model. The retrieved context gets injected into the prompt, and the model generates a response. The cost is input tokens (system prompt + retrieved context + query) plus output tokens (the response).
Here's where the hidden cost of retrieval becomes visible in your generation bill: context stuffing. When you retrieve three to five chunks of 400 tokens each to inject into the prompt, you're adding 1,200 to 2,000 tokens to every single input. On a model that charges $3 per million input tokens, adding 1,500 tokens per query costs $0.0045 per query. At 100,000 queries per day, that's $450 per day — $13,500 per month — in additional input tokens from context injection alone.
This cost scales directly with how much context you retrieve. Teams that retrieve aggressively — pulling ten chunks instead of three because recall matters — pay proportionally more. And because this appears in your generation bill rather than your retrieval infrastructure bill, it often gets attributed to model cost rather than to the RAG architecture decision.
How to Model RAG Costs Honestly
A complete cost model for a RAG pipeline needs to capture all five layers. Here's a practical framework:
| Cost Component | Driver | Billing Model | Often Missed? |
|---|---|---|---|
| Initial corpus embedding | Corpus size (tokens) | Per-token API fee | No |
| Re-embedding on content change | Corpus churn rate | Per-token API fee | Yes |
| Re-embedding on model migration | Model lifecycle | Per-token API fee (one-time) | Yes |
| Vector database / index storage | Corpus size, dimensions | Monthly SaaS or infra | Sometimes |
| Query embedding | Query volume | Per-token API fee | Yes |
| Retrieval search | Query volume | Included in DB cost or compute | Sometimes |
| Context injection tokens | Chunks retrieved × chunk size × query volume | Per-token LLM input fee | Sometimes |
| Generation output | Response length × query volume | Per-token LLM output fee | No |
To get your true cost-per-query, sum all eight components and divide by query volume. For most teams, this number is two to four times higher than the generation-only cost-per-query they've been reporting.
Where the Real Optimization Opportunities Are
Chunk Size and Retrieval Count
The number of chunks you retrieve and the size of each chunk directly determine your context injection cost. This is tunable. Running a retrieval precision analysis — how often does the relevant context appear in the top-1 vs. top-3 vs. top-5 results — gives you the data to make a cost-quality tradeoff. For many workloads, retrieving three smaller chunks outperforms retrieving five larger ones on both quality and cost.
Tiered Retrieval
Not every query needs full corpus retrieval. If you can classify queries at low cost — is this query likely to be answerable from a specific subset of the corpus? — you can route simple queries to a smaller, cheaper index and reserve full retrieval for complex ones. This is operationally more complex but can reduce both retrieval infrastructure costs and context injection costs significantly.
Embedding Model Selection
Embedding models have their own cost-performance frontier. text-embedding-3-small is roughly 5x cheaper than text-embedding-3-large and performs comparably on many retrieval tasks. Testing whether the cheaper model meets your recall requirements before defaulting to the larger one is basic cost hygiene that many teams skip.
Corpus Hygiene
Embedding and re-indexing stale, duplicate, or irrelevant content wastes money and degrades retrieval quality. A corpus that's been cleaned and deduplicated before embedding costs less to index and produces better retrieval results. This sounds obvious, but most RAG implementations embed the corpus as-is and never revisit it.
Caching Query Embeddings
If your query distribution has meaningful repetition — common questions, predictable patterns — caching query embeddings means you don't pay for embedding the same query multiple times. This is a low-complexity optimization with direct per-query cost impact.
What This Means for Architecture Decisions
When you're evaluating whether to use RAG for a new use case, you need to model the full cost stack before committing to the architecture. RAG is not always the cheapest option. For some use cases, the overhead of embedding infrastructure, re-indexing, and context injection makes it more expensive than alternatives — particularly when the corpus is large and high-churn, when retrieval precision is low, or when query volume is high.
The alternatives worth modeling include fine-tuning (higher upfront cost, potentially lower inference cost at scale), context window stuffing for smaller corpora (no retrieval infrastructure needed), and structured retrieval from a conventional database with deterministic lookup rather than semantic search (much cheaper when the retrieval pattern is precise enough to support it).
None of this means RAG is the wrong choice. For the right use cases — large, dynamic corpora with diverse query patterns — it's often the best architecture. But it needs to be chosen with a complete cost model, not just a generation-step estimate.
If you're trying to get this instrumentation in place across your RAG workloads, Oberhahn gives you the attribution layer to track embedding, retrieval, and generation costs as separate components so you're not flying blind when these optimization decisions come up.
The Bottom Line
RAG pipelines are not cheap. They're often cost-effective for what they deliver, but the cost model is substantially more complex than a per-generation-token estimate suggests. Teams that have only been tracking the generation step are underreporting their RAG costs by a significant margin — and making architectural and optimization decisions accordingly.
Do the full math. You might be surprised by what you find.