All articles
AIMay 22, 20264 min read

Building a RAG system that doesn't hallucinate

Retrieval-augmented generation is easy to demo and hard to ship. The retrieval, grounding and evaluation choices that separate a toy from production.

C
Chitt Bhavsar
Building a RAG system that doesn't hallucinate

Retrieval-augmented generation (RAG) is the default pattern for putting an LLM on top of your own data. The demo is trivial: embed some documents, stuff the top matches into the prompt, done. Production is where it gets hard — because a confident wrong answer is worse than no answer.

Retrieval is the whole game

Most "the AI hallucinated" complaints are actually retrieval failures. If the right chunk never makes it into the context, the model fills the gap with plausible fiction. We spend the majority of our effort on retrieval quality, not prompt wording.

RAG pipeline
  • Chunking: split on semantic boundaries, not arbitrary character counts.
  • Hybrid search: combine vector similarity with keyword search to catch exact terms.
  • Re-ranking: a cross-encoder reorders candidates so the best chunk lands first.

Grounding and refusal

We instruct the model to answer only from the supplied context and to say "I don't know" when the context is thin. Then we show citations back to the user so every claim is traceable to a source.

Answer with citations

Finally, none of this is trustworthy without evaluation. We maintain a test set of real questions with known answers and score every change. A RAG system you can't measure is a RAG system you can't improve.