How to Evaluate a RAG System — Chain of Thought

How do you evaluate a RAG system?

Evaluate retrieval and generation separately, because they fail differently. For retrieval, ask whether the right documents came back — measure context relevance and recall. For generation, ask whether the answer is grounded in what was retrieved and actually answers the question — measure faithfulness (no claims beyond the sources) and answer relevance. A RAG system can retrieve perfectly and still hallucinate, or generate beautifully from the wrong documents, so a single end-to-end score hides which half is broken.

Jun 16, 2026 · Chain of Thought

Split the system before you score it

RAG has two stages — retrieve, then generate — and they break for different reasons. If you only measure the final answer, you can’t tell whether a bad response came from fetching the wrong documents or from a model that ignored good ones. Evaluate the two halves separately so you know which one to fix.

Measuring retrieval

The question for retrieval is simple: did the right context come back? Context relevance checks whether the retrieved chunks actually bear on the query; recall checks whether the documents you needed were found at all. Weak retrieval poisons everything downstream — the best model can’t answer from context it never received.

Measuring generation

Given the retrieved context, two things matter. Faithfulness asks whether every claim in the answer is supported by the sources, with nothing invented — this is the anti-hallucination check. Answer relevance asks whether the response actually addresses the user’s question rather than reciting related-but-off material. An answer can be faithful to its sources and still miss the question.

Putting it together

Track retrieval and generation metrics side by side, on your own data, and re-run them on every change. When quality drops, the split tells you immediately whether to tune the retriever or the prompt — instead of guessing at a single number that went down.

How do you evaluate a RAG system?

Split the system before you score it

Measuring retrieval

Measuring generation

Putting it together

From the conversation

Keep exploring