Also known as: iterative RAG, agentic retrieval, self-querying RAG
TL;DR
Agentic RAG is RAG where the model decides what to retrieve, reformulates queries, and iterates — instead of a single pre-baked query going to the index.
Agentic RAG is retrieval-augmented generation where the LLM is in charge of what and when to retrieve, instead of being handed a fixed retrieval result. The model can issue multiple queries, refine them based on what came back, decompose a complex question into sub-questions, and stop only when it judges the evidence sufficient.
The contrast with one-shot RAG
Classic RAG is a fixed pipeline: take user query → embed → retrieve top-K → put in prompt → generate. The LLM never gets to influence retrieval. If the user’s phrasing happens to match the index well, this works. If not — wrong retrieval and an answer hallucinated on top of it.
Agentic RAG exposes retrieval as a tool . The model decides:
Whether to retrieve at all
What query to issue (often rewritten from the user’s input)
Whether the result is enough to answer, or whether to retrieve again with a refined query
How to combine multiple rounds of evidence into one answer
What this buys you
Multi-hop questions. “Which of our customers in the EU signed before our pricing change?” needs a customer-list lookup and a pricing-history lookup, then a join. One-shot retrieval can’t do this; an agent that can call retrieval twice can.
Vague queries. When the user’s wording doesn’t match the corpus, the model can rephrase (“the user said ‘fixing latency’ — let me search ‘p95 response time optimization’”) instead of returning irrelevant top-K and hallucinating.
Self-correction. First retrieval came back empty or off-topic? Try a different query. One-shot pipelines just plow forward with bad evidence.
What it costs
Latency. Multiple retrieval rounds + multiple model calls per question. A 200ms one-shot RAG becomes 2–5 seconds of agentic RAG easily.
Token spend. Each round adds intermediate reasoning and tool-call overhead.
Loop discipline. Without bounds, the model can keep retrieving forever in pursuit of marginal certainty. Production systems cap rounds (often 3–5) and have explicit “good enough” criteria.
Why retrieval quality matters more here, not less
Naive intuition says “if the model can iterate, retrieval doesn’t have to be great on the first shot.” Wrong direction. Each round in an agentic loop reads the previous round’s results to decide its next query. If those results are noisy, the model’s next query is informed by noise. Errors compound.
Each iteration of an agentic loop reads the previous iteration’s noise. Bad retrieval doesn’t average out — it amplifies.
A calibrated reranker with stable thresholds is more, not less, important in agentic RAG. It gives the model a trustworthy “is this actually relevant?” signal it can use to decide whether to keep iterating or stop. A calibrated, instruction-following reranker that runs in sub-100ms is the right shape for this slot.
Agentic RAG sits at the intersection of three patterns that get conflated. Query rewriting reformulates the user’s input into a better retrieval query — synonym expansion, removing chit-chat, adding domain vocabulary. Query decomposition breaks a complex question into sub-questions (“Which EU customers signed before our pricing change?” → “list EU customers” + “list pricing change date” + “filter by signed date”). And the iterative-refinement pattern reissues a query after seeing partial results.
Production-grade agentic RAG composes all three. The first round usually rewrites and decomposes; later rounds refine based on what came back. The agent’s prompt has to specify when to use which — otherwise the model conflates them, decomposing when it should be refining or vice versa.
The cost surface differs sharply too. Decomposition multiplies retrieval cost (N sub-queries instead of one). Refinement multiplies model calls (N iterations instead of one). Rewriting is essentially free. Production systems that survive measure each one’s contribution to end-to-end quality and prune ruthlessly — most queries don’t need decomposition, and unnecessary refinement just inflates latency.
The naive answer — “stop when the model says it’s done” — fails for the same reason it fails in any agent loop. The model can be confidently done when it shouldn’t be, or stuck in an inflate-then-keep-going cycle when the evidence is sufficient.
Three external signals work. Confidence on a calibrated relevance score: if the top-K reranker scores from this round are all above a threshold, the agent has high-quality evidence; if they’re all below, more retrieval probably won’t help (the corpus doesn’t contain the answer). Coverage on decomposed sub-questions: track which sub-questions the agent enumerated and stop when each has at least one supporting document. Marginal-gain check: compare this round’s top-K against the prior round’s; if the new candidates don’t add information not already present, the iteration has saturated.
The harder problem is the cost ceiling. Agentic RAG without a hard step budget will sometimes spend ten model calls and a dollar on a query that an LLM with one well-formed retrieval round would have answered for two cents. A budget cap (max 3-5 retrieval rounds, max N output tokens, max wall-clock) is the brute-force backstop and should always be wired regardless of the smarter signals above.
Go further
When is one-shot RAG enough?
Short, well-formed questions over a tight corpus where one retrieval surfaces the answer. Most FAQ bots, single-document Q&A, and template-driven lookups don't need iteration. Reach for agentic RAG when queries are vague, span multiple sub-questions, or require comparing across sources.
How do I keep an iterative loop from spinning forever?
Step budgets, retrieval-cost ceilings, and a 'good enough' check after each round. Combine with reflection — the model evaluates whether the current evidence is sufficient before deciding to retrieve again. Without bounds, agentic RAG happily burns tokens chasing diminishing returns.
Each retrieval round produces candidates that need reordering before the model sees them — otherwise iteration just compounds first-pass noise. A calibrated reranker also gives you a 'confidence score' the agent can use to decide whether to retrieve again or call it done.