Retrieval-Augmented Generation (RAG) has become one of the dominant patterns for building LLM-powered applications that need to answer questions from a specific knowledge base. The basic idea is appealing in its simplicity: embed your documents, retrieve the relevant chunks at query time, and pass them to an LLM as context. But between the simple concept and a system that performs reliably in production, there are dozens of decisions that meaningfully affect quality.
This post covers the architectural decisions we've found most impactful across the RAG systems we've built — including several that aren't obvious until you've been burned by them.
1. Chunking Strategy Is Not an Afterthought
Most tutorials suggest splitting documents into fixed-size chunks with some overlap. This works well enough for demos but degrades significantly in production when your knowledge base contains structured content — policy documents, technical specs, FAQs, or anything with meaningful section boundaries.
We've had better results with a hierarchical chunking approach: parse documents into their natural sections first (using headers, paragraph boundaries, or semantic breaks), then chunk within sections. This preserves context that fixed-size chunking destroys — a chunk that contains half of one section and half of the next is often worse than either section alone.
For highly structured content (tables, lists, step-by-step procedures), treat each structure as its own chunk and index it separately. Retrieval accuracy for tabular data improves dramatically when the table is embedded as a unit rather than split mid-row.
2. Use Metadata Filtering Before Semantic Search
Pure semantic search across a large corpus is slower and less precise than it needs to be. In most real applications, the user query contains implicit filters — a question about "refund policy" from a user who is on the Enterprise plan should retrieve Enterprise-tier policy documents, not the generic policy page.
We build metadata filtering as a first pass before semantic retrieval. Document type, product tier, date range, department, language — any structured attribute of the document that can narrow the search space should be extracted at index time and used to pre-filter at query time. This both improves precision and reduces latency.
An LLM-based query parser can extract these filter parameters from natural language queries before they hit the vector search — a lightweight but high-leverage step.
3. Reranking Is Almost Always Worth It
Embedding-based retrieval is good at finding semantically related content but poor at ranking by relevance for a specific query. The top result from a vector search is often not the most useful chunk for answering the question — it's just the most similar in embedding space.
Adding a reranking step — where a cross-encoder model scores each retrieved chunk against the query — consistently improves answer quality in our testing. We typically retrieve a larger candidate set (top 20–30) from vector search, then rerank and pass only the top 5–8 chunks to the LLM. The additional latency (typically 100–300ms) is worth the quality improvement in almost every production use case we've encountered.
Off-the-shelf reranking models like Cohere Rerank or open-source cross-encoders from the sentence-transformers library are a good starting point. Fine-tuned domain-specific rerankers yield additional gains for specialized corpora.
4. Retrieval Evaluation Is Non-Negotiable
One of the most common failure modes we see in RAG projects is teams evaluating only the final LLM output — "does the answer sound right?" — without separately evaluating retrieval quality. This makes it nearly impossible to debug failures, because you can't tell whether a bad answer came from a retrieval failure (wrong chunks surfaced) or a generation failure (right chunks, wrong synthesis).
We build separate evaluation pipelines for retrieval and generation. For retrieval, we maintain a test set of queries paired with ground-truth relevant documents, and measure recall@k and precision@k as the retrieval pipeline evolves. For generation, we use LLM-as-judge evaluation with rubrics for factual accuracy, groundedness, and completeness.
Frameworks like RAGAS provide useful off-the-shelf metrics for both retrieval and generation evaluation and integrate well into CI/CD pipelines.
5. Keep the Context Window Clean
More retrieved context is not always better. LLMs are susceptible to the "lost in the middle" problem — relevant information buried in a long context tends to be underweighted in the model's attention relative to content at the beginning and end of the prompt. Padding the context window with loosely relevant chunks degrades answer quality even when the truly relevant chunk is present.
We apply a strict context budget: calculate the expected token count of retrieved chunks before passing them to the LLM, and trim aggressively if it exceeds the target. In practice, 4–6 high-quality chunks usually outperform 15–20 loosely filtered ones.
Key Takeaways
- Chunk documents by semantic structure, not fixed token counts
- Use metadata filtering to pre-narrow the retrieval space before vector search
- Add a reranking step — the latency cost is almost always worth it
- Evaluate retrieval and generation separately to isolate failure modes
- Favor fewer, higher-quality chunks over large, unfocused context windows
