Building a RAG system that doesn't hallucinate
Retrieval-augmented generation is easy to demo and hard to ship. The retrieval, grounding and evaluation choices that separate a toy from production.
Retrieval-augmented generation (RAG) is the default pattern for putting an LLM on top of your own data. The demo is trivial: embed some documents, stuff the top matches into the prompt, done. Production is where it gets hard — because a confident wrong answer is worse than no answer.
Retrieval is the whole game
Most "the AI hallucinated" complaints are actually retrieval failures. If the right chunk never makes it into the context, the model fills the gap with plausible fiction. We spend the majority of our effort on retrieval quality, not prompt wording.
- Chunking: split on semantic boundaries, not arbitrary character counts.
- Hybrid search: combine vector similarity with keyword search to catch exact terms.
- Re-ranking: a cross-encoder reorders candidates so the best chunk lands first.
Grounding and refusal
We instruct the model to answer only from the supplied context and to say "I don't know" when the context is thin. Then we show citations back to the user so every claim is traceable to a source.
Finally, none of this is trustworthy without evaluation. We maintain a test set of real questions with known answers and score every change. A RAG system you can't measure is a RAG system you can't improve.
