Faithfulness is a binary property of an LLM-generated answer: is every claim in the answer supported by the context that was passed in? A faithful answer doesn’t have to be globally true; it has to be derivable from the retrieved documents. An unfaithful answer asserts something the context doesn’t support — usually because the model drew on parametric memory or hallucinated outright.
For RAG systems shipping into legal, medical, financial, or any regulated workflow, faithfulness is the load-bearing reliability metric. Accuracy alone isn’t auditable; faithfulness is — every claim has a citable source.
Accuracy is what you wish you could verify; faithfulness is what you actually can. Every claim either traces to a span in the retrieved context or it doesn’t — that property is locally testable, which is exactly what regulated workflows need.
Faithfulness vs adjacent concepts
Three terms get conflated in the literature:
Faithfulness (also “groundedness”) — claim is supported by the retrieved context.
Accuracy — claim is true in the world.
Relevance — retrieved documents are relevant to the question.
A pretrained model can answer “What’s the capital of France?” accurately without retrieval, but the answer is unfaithful if no retrieved doc contained it. Conversely, you can write a perfectly faithful summary of a document that’s itself wrong — the answer reflects the source, even when the source is bad. Most production systems care about faithfulness and accuracy, but only faithfulness is locally testable from the (context, answer) pair.
How faithfulness is measured
The standard recipe (RAGAS, TruLens, DeepEval all implement variants):
Decompose the answer into atomic claims. “The drug reduced LDL by 15% in a 6-month trial” → claim 1 (LDL reduction of 15%) + claim 2 (6-month trial). Done by an LLM extractor.
For each claim, check the retrieved context. Three-way classification: entailed / not entailed / contradicted. An LLM judge handles this — tell it to find a supporting span or return “no support”.
Aggregate. Per-answer faithfulness = fraction of claims entailed. Per-corpus faithfulness = mean across answers.
A claim that doesn’t appear in the context but matches the model’s parametric knowledge is unfaithful even if it’s true — the system can’t cite it.
Atomic-claim extraction is itself an LLM task, and it has two failure modes that bias the metric. Over-decomposition splits a single fact into several derivative claims (“the trial ran for 6 months” + “the trial measured LDL” + “LDL was reduced”) that all redundantly check the same source span; the metric then over-counts faithfulness on this answer because one good span supports three claims. Under-decomposition leaves compound claims merged (“the drug reduced LDL by 15% in a 6-month trial”) so a partial-support span (mentions LDL but not duration) gets marked entailed when it shouldn’t. Robust graders pin a target granularity (“subject-predicate-object triples” or “single quantitative assertions”) and validate the extractor on a small gold set before trusting per-corpus numbers. RAGAS exposes the decomposition prompts so you can audit them; bespoke harnesses should do the same.
Yes — and increasingly is. A frontier-LLM judge run per-claim is expensive ($0.001-0.01 per claim) and slow (200-1000 ms), which makes online faithfulness checking economically painful for high-throughput products. The fix is the same as everywhere else in production AI: distill a small specialized model. A fine-tuned 1-3B parameter encoder trained on (claim, context, label) triples runs at single-digit milliseconds and matches frontier-LLM judges on entailment within 1-2 F1 points. The judge LLM stays for offline eval and harder distributions.
Why faithfulness depends on the whole stack
The diagnostic order:
Check Recall@K of first-pass. If the supporting passage isn’t in the candidate set, no downstream component can save you. Fix with better embeddings, hybrid search, or query rewriting.
Check NDCG@K of the reranker. If the supporting passage is in the top-100 but at position 80, the prompt likely truncated it. Cross-encoder reranker over the candidate set fixes this.
Check the prompt. “Answer using only the provided context. If the context doesn’t contain the answer, say so.” Cuts parametric leakage substantially.
This is why faithfulness is the pipeline-level metric — it tells you the full chain worked, not just one link.
The role of citations
A faithful answer paired with explicit per-claim citations gives you something stronger: an auditable answer. Each claim points back to a span in the retrieved documents, and a reviewer (human or model) can verify the trace independently of trusting the LLM. See citation extraction for how that gets implemented as a distinct task.
For high-stakes RAG, faithfulness without citations is necessary but not sufficient — you also need to surface which span supports each claim. Together, they’re what makes RAG defensible.
Common unfaithfulness failure modes
Numeric drift: context says “15.2%”, answer says “about 15%” — usually fine, but the entailment grader has to be calibrated for it
Aggregation that goes beyond context: “the three trials all showed reduction” when only two are in the retrieved context
Parametric leakage: an entity is mentioned in the context, but a specific fact about it (founding year, headquarters) comes from training data
Citation misalignment: claim is supported, but cited to the wrong span — passes a coarse faithfulness check, fails an audit
Go further
Faithfulness vs accuracy — what's the difference?
Accuracy asks 'is the answer true?'. Faithfulness asks 'is the answer supported by the documents you retrieved?'. A faithful answer can still be inaccurate (the documents themselves were wrong). An accurate answer can be unfaithful (the model knew it from pretraining, the documents didn't say it). For RAG you usually want both — but only faithfulness is auditable.
Decompose the answer into atomic claims, then for each claim, check whether the retrieved context entails it. The check is a binary classification (entailment / no-entailment / contradiction) typically run by an LLM judge. RAGAS, TruLens, and DeepEval all implement variants of this. Per-answer faithfulness is the fraction of claims that are supported.
Three: the retrieval missed the supporting passage (Recall@K problem — fix with better first-pass and a reranker), the relevant passage made it through but got buried under noise (chunking and reranker problem), or the model leaned on parametric knowledge instead of the context (prompt-engineering problem — explicit instructions to cite from context only).