If you have ever built anything that involves feeding documents into an AI model, you have run into the chunking problem. Your document is too long to fit in the context window, or too expensive to send in full every time, so you need to break it into smaller pieces. Simple enough in theory. In practice, how you do that splitting has a surprisingly large impact on the quality of the answers your application produces.
Most tutorials on this topic show you how to split text mechanically, every five hundred words, every thousand characters, wherever a paragraph ends. What they do not explain is why that approach breaks down in real documents and what to do instead. That is what this article is about.
Why naive chunking produces bad answers
The most common chunking approach is fixed-size splitting, where you divide a document into pieces of a set length, maybe five hundred tokens each, with some overlap between adjacent chunks so context is not completely lost at the boundaries.
This works reasonably well for some types of content but it fails in predictable ways for others. The core problem is that meaning in text does not distribute evenly. A document about a complex topic might have three sentences that contain the most important information and pages of supporting detail. A fixed-size splitter has no way of knowing that and treats every chunk as equally likely to be relevant.
The more specific failure mode is when a piece of information spans a chunk boundary. If a sentence starts at the end of one chunk and finishes at the beginning of the next, neither chunk contains the complete thought. A retrieval system that finds the first chunk will give the model an incomplete sentence. A retrieval system that finds the second chunk will give the model a conclusion without the setup. Either way, the model is working with incomplete information and the answer quality suffers.
Semantic chunking is usually worth the extra effort
Instead of splitting by size, semantic chunking splits by meaning. The idea is to keep related ideas together in the same chunk rather than splitting them based on arbitrary length thresholds. This can be as simple as splitting on paragraph boundaries, which at least keeps individual paragraphs intact, or as sophisticated as using an embedding model to detect when the topic shifts and splitting at those transition points.
For most practical applications, paragraph-based splitting is a good starting point. Paragraphs are usually written to contain a complete thought, which makes them reasonable natural units for retrieval. A paragraph-based chunker that also respects section headings, keeping the heading with the first paragraph of each section, produces chunks that are much more self-contained than fixed-size chunks.
The tradeoff is that paragraph-based chunks have variable sizes, which means some chunks will be much larger than others. You can manage this by setting a maximum chunk size and splitting large paragraphs further when needed, while keeping small paragraphs together if they are part of the same thought.
According to Stanford HAI's research on information retrieval for language models, the quality of chunking strategy has a measurable impact on retrieval accuracy in RAG systems, with semantically coherent chunks consistently outperforming fixed-size chunks on tasks that require synthesizing information from specific parts of a document.
The overlap question
Most chunking implementations include some overlap between adjacent chunks, where the last hundred tokens of one chunk are repeated at the start of the next. The idea is that if a relevant piece of information sits near a chunk boundary, at least one of the adjacent chunks will contain it in full.
Overlap helps with the boundary problem but it comes at a cost. Every overlapping token is a token you pay for when that chunk gets retrieved. If you have a ten thousand token document split into twenty chunks with a hundred token overlap, you are paying for two thousand extra tokens of overlap every time any chunk from that document gets retrieved.
The right amount of overlap depends on your content. Documents with dense, highly interconnected information benefit from more overlap. Documents with clear section breaks and self-contained paragraphs need less. A hundred to two hundred tokens of overlap is a reasonable starting point for most documents, and you can tune it based on whether you are seeing boundary-related errors in your answers.
Chunk size affects more than just retrieval quality
The size of your chunks affects several things at once and optimizing for one sometimes creates tradeoffs with the others. Smaller chunks are cheaper per retrieval because each retrieved chunk uses fewer tokens, and they allow more precise retrieval because a small chunk is more likely to be uniformly relevant to a query. But smaller chunks lose context, and a model that receives a very small chunk might not have enough surrounding information to use it effectively.
Larger chunks give the model more context and are more likely to contain complete thoughts, but they cost more tokens per retrieval and reduce the precision of your retrieval step because a large chunk might be only partially relevant to the query.
For most text documents, chunks in the three hundred to seven hundred token range tend to work well as a starting point. For technical documentation, code, or highly structured content, you might want larger chunks because the content is denser and context matters more. For simple factual content like product descriptions or FAQ entries, smaller chunks often work fine.
The Context Window Visualizer on Prompt Toolbox helps you understand how your chunk sizes relate to the overall context window of the model you are using, which is useful for planning how many chunks you can retrieve per request before you start running into space constraints.
Metadata is part of chunking too
One thing that makes a big difference in practice but gets left out of most chunking tutorials is metadata. When you store a chunk in your vector database, you can store additional information alongside the text, things like the document title, the section heading, the page number, the date the document was last updated, or any other context that helps identify where the chunk came from.
When the model receives a retrieved chunk, it only sees what you give it. If you give it raw text with no indication of where it came from or what document it belongs to, the model has no way to distinguish between information from different sources or to caveat its answer based on the age or authority of the source. Including a brief metadata header with each chunk, something as simple as the document title and section name, gives the model useful context for interpreting the retrieved information.
This adds tokens to each retrieved chunk but it usually improves answer quality enough to be worth it, especially in applications where users might ask questions that span multiple documents and the model needs to synthesize information from different sources coherently.
Testing your chunking strategy
The only reliable way to know if your chunking strategy is working well is to test it with questions that your application will actually be asked. Build a small set of test questions that cover the range of things users might ask, run them through your retrieval system, look at what chunks get retrieved, and ask whether a model with access to only those chunks could give a good answer to the question.
If you consistently find that the right information is split across multiple chunks or that retrieved chunks are only partially relevant, your chunking strategy needs adjustment. If the retrieved chunks consistently contain the information needed to answer the question, you are in good shape.
This kind of manual inspection takes time but it surfaces problems that automated metrics often miss, and it gives you a much clearer intuition for how your retrieval system is actually behaving than any benchmark score will.
