If you have spent any time reading about AI applications beyond basic chatbots, you have probably run into the term RAG. It gets thrown around a lot in developer circles and AI product discussions, sometimes explained well and sometimes used as if everyone already knows what it means. This article explains what it actually is, why it matters for building useful AI applications, and something that most RAG explainers skip entirely: what it does to your token usage and costs.
The problem RAG solves
AI models are trained on data up to a certain point in time, and they have no knowledge of anything that happened after that cutoff. They also have no knowledge of your specific data, your company's documents, your product catalog, your internal policies, or anything else that exists only in your systems.
This creates a real limitation for a lot of practical applications. A customer support bot that cannot answer questions about your specific products is not very useful. A research assistant that does not know about documents published after its training cutoff is missing a lot of relevant information. An internal tool that cannot access your company's knowledge base is solving a much smaller problem than it could be.
The traditional solution was to fine-tune the model on your specific data, which is expensive, time-consuming, and requires retraining every time your data changes. RAG is a different approach that solves the same problem much more practically for most use cases.
What RAG actually stands for and what it does
RAG stands for Retrieval Augmented Generation. The name describes exactly what it does. Before the model generates a response, a retrieval step finds the most relevant pieces of information from your data and adds them to the prompt as context. The model then generates its response based on both its training and the retrieved information.
The practical flow looks like this. A user asks a question. Your system takes that question, searches your document database for the most relevant passages or chunks, pulls those passages out, and adds them to the prompt along with the user's question. The model reads all of that context and generates an answer based on what it found.
The result is a model that can answer questions about your specific data, give responses grounded in your actual documents, and stay current as long as you keep your document database updated, all without needing to retrain anything.
Why this is better than just putting all your documents in the prompt
The obvious alternative to retrieval is to just include all your documents in the prompt every time. If you have a knowledge base the model needs to reference, why not put the whole thing in the context window and let the model find what it needs?
For small knowledge bases this actually works fine. But it breaks down quickly as the amount of information grows. A large knowledge base might be millions of tokens, which far exceeds the context window of any current model. Even if your knowledge base fits in the context window, sending the entire thing with every request is extremely expensive because you are paying for all those input tokens on every single call.
Retrieval solves this by being selective. Instead of sending everything, you send only the pieces that are most likely to be relevant to the specific question being asked. A well-designed retrieval step might pull three to ten chunks of text that together are a few thousand tokens, rather than an entire knowledge base that might be millions of tokens. The model gets the information it needs without you paying for everything it does not need.
According to MIT Technology Review's coverage of enterprise AI applications, RAG has become the dominant architecture for knowledge-intensive AI applications in production, largely because it balances retrieval accuracy with cost efficiency in a way that alternatives like full fine-tuning do not.
What RAG does to your token counts
This is the part most RAG explainers skip, and it matters a lot if you are trying to build a cost model for a RAG-based application.
Every chunk of retrieved text that you add to your prompt is additional input tokens that you pay for on every request. If your retrieval step pulls three chunks of five hundred tokens each, that is fifteen hundred tokens of retrieved context added to every prompt, on top of your system prompt, conversation history, and the user's message.
In a well-designed RAG system, this is still much cheaper than sending your entire knowledge base every time. But it is meaningfully more expensive than a simple prompt without retrieval, and the cost scales with how much context your retrieval step pulls in.
The chunk size you use in your retrieval system directly affects your token costs. Larger chunks mean more context per retrieved piece, which is sometimes useful for giving the model enough surrounding information to understand a passage, but it also means more tokens per request. Smaller chunks are cheaper but might not give the model enough context to use the information effectively. Finding the right chunk size for your specific content is one of the key tuning decisions in any RAG implementation.
The number of chunks you retrieve also matters. Retrieving ten chunks gives the model more to work with than retrieving three, but it also costs more than three times as much in input tokens because the overhead of the rest of your prompt is the same either way. Most production RAG systems settle on retrieving three to five chunks as a starting point and tune from there based on answer quality versus cost.
How to think about context window usage in RAG applications
The Context Window Visualizer on Prompt Toolbox is particularly useful for RAG applications because you are managing several different sources of tokens simultaneously: your system prompt, the retrieved chunks, the conversation history if you maintain one, the user's message, and the model's response.
Each of those components competes for space in the context window, and understanding how much space each one typically uses helps you make better decisions about chunk sizes, how much history to maintain, and how detailed your system prompt needs to be.
A common mistake in early RAG implementations is designing the system without accounting for how all these components add up. You might have a system prompt that uses eight hundred tokens, retrieve five chunks that together use three thousand tokens, maintain a conversation history that grows over time, and then wonder why you are hitting context limits or paying more than expected.
What retrieval quality means for token efficiency
One thing that affects your token costs that is easy to overlook is how good your retrieval step is at finding actually relevant information. A retrieval system that consistently finds the right chunks means you can get good answers with fewer retrieved chunks, which means fewer tokens per request. A retrieval system that is less precise needs to pull more chunks to have a good chance of including the right information, which means more tokens and higher costs for the same quality of answers.
Investing in retrieval quality is not just an accuracy decision, it is a cost decision. Better embeddings, better chunking strategies, and better similarity search all reduce the number of chunks you need to retrieve to get good answers, and that reduction compounds across every request your application handles.
Where RAG fits in the broader picture
RAG is not the right solution for every problem. If your application does not need access to external documents or specific private data, you do not need it and the added complexity is not worth it. If your knowledge base is small enough to fit in a context window and your request volume is low enough that the cost of including it every time is manageable, the simpler approach might be fine.
But for applications that need to be grounded in specific, current, or proprietary information, RAG is the most practical architecture available right now. Understanding how it affects your token usage lets you implement it in a way that is both effective and financially sustainable, rather than discovering the cost implications after you have already built everything around it.
