Agentic RAG

Also known as: iterative RAG, agentic retrieval, self-querying RAG

TL;DR

Agentic RAG is RAG where the model decides what to retrieve, reformulates queries, and iterates — instead of a single pre-baked query going to the index.

Agentic RAG is where the LLM is in charge of what and when to retrieve, instead of being handed a fixed retrieval result. The model can issue multiple queries, refine them based on what came back, decompose a complex question into sub-questions, and stop only when it judges the evidence sufficient.

AGENTIC RAGRetrieval becomes a tool the model calls until the evidence is enough.ONE-SHOT RAGfixed pipeline · one retrieval · no inspectionQUERYone phrasingRETRIEVEtop-K onceMODELsplice + answerANSWERwhatever came backAGENTIC RAGmodel picks query · iterates until calibrated score crossesREWRITE · RE-QUERYQUERYvagueRETRIEVEthis roundMODELinspect hitsREFLECTenough?ANSWERcited + final

The contrast with one-shot RAG

Classic RAG is a fixed pipeline: take user query → embed → retrieve top-K → put in prompt → generate. The LLM never gets to influence retrieval. If the user’s phrasing happens to match the index well, this works. If not — wrong retrieval and an answer hallucinated on top of it.

Agentic RAG exposes retrieval as a . The model decides:

  • Whether to retrieve at all
  • What query to issue (often from the user’s input)
  • Whether the result is enough to answer, or whether to retrieve again with a refined query
  • How to combine multiple rounds of evidence into one answer

What this buys you

  • Multi-hop questions. “Which of our customers in the EU signed before our pricing change?” needs a customer-list lookup and a pricing-history lookup, then a join. One-shot retrieval can’t do this; an agent that can call retrieval twice can.
  • Vague queries. When the user’s wording doesn’t match the corpus, the model can rephrase (“the user said ‘fixing latency’ — let me search ‘p95 response time optimization’”) instead of returning irrelevant top-K and hallucinating.
  • Self-correction. First retrieval came back empty or off-topic? Try a different query. One-shot pipelines just plow forward with bad evidence.

What it costs

  • Latency. Multiple retrieval rounds + multiple model calls per question. A 200ms one-shot RAG becomes 2–5 seconds of agentic RAG easily.
  • Token spend. Each round adds intermediate reasoning and tool-call overhead.
  • Loop discipline. Without bounds, the model can keep retrieving forever in pursuit of marginal certainty. Production systems cap rounds (often 3–5) and have explicit “good enough” criteria.

Why retrieval quality matters more here, not less

Naive intuition says “if the model can iterate, retrieval doesn’t have to be great on the first shot.” Wrong direction. Each round in an agentic loop reads the previous round’s results to decide its next query. If those results are noisy, the model’s next query is informed by noise. Errors compound.

Each iteration of an agentic loop reads the previous iteration’s noise. Bad retrieval doesn’t average out — it amplifies.

A calibrated with stable thresholds is more, not less, important in agentic RAG. It gives the model a trustworthy “is this actually relevant?” signal it can use to decide whether to keep iterating or stop. A calibrated, instruction-following reranker that runs in sub-100ms is the right shape for this slot.

Agentic RAG sits at the intersection of three patterns that get conflated. reformulates the user’s input into a better retrieval query — synonym expansion, removing chit-chat, adding domain vocabulary. Query decomposition breaks a complex question into sub-questions (“Which EU customers signed before our pricing change?” → “list EU customers” + “list pricing change date” + “filter by signed date”). And the iterative-refinement pattern reissues a query after seeing partial results.

Production-grade agentic RAG composes all three. The first round usually rewrites and decomposes; later rounds refine based on what came back. The agent’s prompt has to specify when to use which — otherwise the model conflates them, decomposing when it should be refining or vice versa.

The cost surface differs sharply too. Decomposition multiplies retrieval cost (N sub-queries instead of one). Refinement multiplies model calls (N iterations instead of one). Rewriting is essentially free. Production systems that survive measure each one’s contribution to end-to-end quality and prune ruthlessly — most queries don’t need decomposition, and unnecessary refinement just inflates latency.

The naive answer — “stop when the model says it’s done” — fails for the same reason it fails in any agent loop. The model can be confidently done when it shouldn’t be, or stuck in an inflate-then-keep-going cycle when the evidence is sufficient.

Three external signals work. Confidence on a calibrated relevance score: if the top-K reranker scores from this round are all above a threshold, the agent has high-quality evidence; if they’re all below, more retrieval probably won’t help (the corpus doesn’t contain the answer). Coverage on decomposed sub-questions: track which sub-questions the agent enumerated and stop when each has at least one supporting document. Marginal-gain check: compare this round’s top-K against the prior round’s; if the new candidates don’t add information not already present, the iteration has saturated.

The harder problem is the cost ceiling. Agentic RAG without a hard step budget will sometimes spend ten model calls and a dollar on a query that an LLM with one well-formed retrieval round would have answered for two cents. A budget cap (max 3-5 retrieval rounds, max N output tokens, max wall-clock) is the brute-force backstop and should always be wired regardless of the smarter signals above.

Go further

When is one-shot RAG enough?

Short, well-formed questions over a tight corpus where one retrieval surfaces the answer. Most FAQ bots, single-document Q&A, and template-driven lookups don't need iteration. Reach for agentic RAG when queries are vague, span multiple sub-questions, or require comparing across sources.

How do I keep an iterative loop from spinning forever?

Step budgets, retrieval-cost ceilings, and a 'good enough' check after each round. Combine with reflection — the model evaluates whether the current evidence is sufficient before deciding to retrieve again. Without bounds, agentic RAG happily burns tokens chasing diminishing returns.

Where does reranking fit in agentic RAG?

Each retrieval round produces candidates that need reordering before the model sees them — otherwise iteration just compounds first-pass noise. A calibrated reranker also gives you a 'confidence score' the agent can use to decide whether to retrieve again or call it done.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord