Here is a common scenario that frequently arises on AI teams.

The team builds a RAG-powered support bot. It retrieves documents from the knowledge base, passes them to the model, and the model answers.

In testing, it looks solid. Users start asking questions, and something unexpected happens: ambiguous questions get overconfident answers.

Multi-part queries get responses that only address one part.

When the knowledge base doesn't contain the answer, the model doesn't say so. It gives something plausible-sounding from whatever it retrieved.

Classic RAG: one retrieval pass, then generate. Clean in theory; brittle in practice. (source)

Everyone assumed retrieval was the hard part.

But it turns out that a single retrieval pass isn't enough.

This gap, between what classic RAG promises and what it actually delivers in production, is why agentic RAG exists.

Let’s dig in!

What Classic RAG Actually Does

If you read the 2020 RAG paper from Meta, the concept is elegant. An LLM has parametric memory, knowledge baked into its weights during training.

RAG adds non-parametric memory, a live, searchable knowledge base that the model can consult at inference time. The pipeline looks like this:

A user asks a question, the system embeds that question into a vector, searches a vector database for similar chunks of text, retrieves the top-k results, puts them into the prompt, and asks the LLM to generate an answer using that context.

One retrieval pass and a generation step. That's it.

It solves the hallucination problem from stale training data. Instead of guessing from what it learned months ago, it grounds its answer in actual documents you control.

But the pipeline is linear and static. The model asks, retrieves, and generates once. There is no check to see if what it retrieved was actually relevant.

No agent to notice when the question was too vague to retrieve anything useful. No way to go back and look harder if the first pass came up short.

For simple, well-scoped questions against a clean knowledge base, this works fine. For anything more complex, the seams start to show.

Why One Pass Isn't Enough

Think about how a good researcher handles an ambiguous question.

You ask: "What are the risks of our current deployment strategy?" A junior researcher searches "deployment risks," pulls the first five documents, and writes a summary.

But A senior researcher pauses. They notice the question is incomplete.

They search for deployment risks in general, your infrastructure, and recent incidents that might be relevant. They check if what they found actually answers the question.

If something is missing, they search again.

Classic RAG is the junior researcher. It doesn't pause, check, or try again.

The failure modes that surface in production are predictable:

Single-hop retrieval fails multi-step questions.

A question like "How did the decision to change our pricing model affect churn last quarter?" requires retrieving from multiple locations, like pricing history and churn data.

Classic RAG pulls one batch of documents and tries to make it work.

It doesn't do a quality check on what was retrieved. The retrieval step can return irrelevant, outdated, or contradictory information.

Classic RAG passes them to the model regardless. The model generates an answer even when the context doesn't actually support one.

  • Query-document mismatch. The way users phrase questions often doesn't match how documents are written. A user might ask, "Why did my order fail?" when the relevant document talks about "payment processing errors." Embedding similarity helps, but when it fails, there's no recovery path.

  • Context window ceiling. Classic RAG retrieves a fixed number of chunks. For complex topics that span multiple documents, you are forced to choose either retrieving more (and risk overwhelming the model with noise) or retrieving less (and risk missing what matters).

Agentic RAG: The Same Idea, But With a Loop

 From linear chains to state machines: how agentic RAG introduces decision points and loops into retrieval. (source)

Agentic RAG doesn't throw out the core idea of RAG.

It wraps an agent around the retrieval process so that retrieval becomes a reasoned, iterative activity rather than a fixed pipeline step.

Instead of the system deciding upfront what to retrieve and how much, the agent decides, and will be deciding, as the task unfolds.

We can see the difference in what questions get asked at each stage. Classic RAG asks, "What documents are most similar to this query?"

But agentic RAG also asks:

  • "Is this query specific enough to retrieve well, or should I rewrite it first?"

  • "Did what I retrieved actually answer the question?"

  • "Is the answer I'm about to generate supported by my sources?"

  • "Do I need to retrieve again from a different angle?"

That shift, from retrieval as a pipeline to retrieval as a decision, is what makes agentic RAG significantly more robust on complex tasks.

Four Patterns Worth Knowing

Researchers and engineers have been building specific versions of agentic RAG, each targeting a different failure mode. Here are the four most practically relevant patterns:

Pattern 1: Iterative RAG is the simplest upgrade.

After retrieving and generating a first-pass answer, the agent checks whether the answer is complete and well-supported.

If not, it retrieves again, with a reformulated query, and refines. The loop runs until a quality threshold is met or a maximum iteration count is hit.

Pattern 2: Self-RAG takes this further by training the model itself to generate special "reflection tokens" as it works.

These tokens are self-annotations: "should I retrieve for this?", "is this retrieved document relevant?", "is my output supported by what I retrieved?".

Self-RAG's decision graph: the model checks whether to retrieve, whether retrieved content is relevant, and whether the output is actually supported at every step. (source)

The model learns to interrogate its own process in real time, and can skip retrieval when it isn't needed, making it both more accurate and efficient depending on the query.

Pattern 3: Corrective RAG (CRAG) addresses the quality issue at first. A lightweight evaluator scores each retrieved document for query, ambiguous, or clearly wrong.

Documents that get "correct" go through a knowledge refinement step to remove the noise. Documents scored "incorrect" or "ambiguous" trigger a web search.

The system then reformulates the query and pulls fresh external content before the LLM ever sees it. It never generates from context that it has already flagged as bad.

Corrective RAG (CRAG): retrieved documents are scored before generation. Poor-quality retrieval triggers a web search fallback. (source)

Pattern 4: Adaptive RAG takes a different approach. Rather than always running the full agentic loop, it first classifies each query by complexity.

Simple factual questions skip retrieval entirely. Moderately complex queries get a single retrieval pass. Only difficult, multi-step questions get the full iterative treatment.

A small classifier model makes the routing decision.

The practical benefit is that it provides lower latency and cost for the majority of queries that don't require the full treatment, while still getting strong results on the hard ones.

The Tradeoffs Are Real

Agentic RAG generates better answers for complex questions. That's not in dispute. But it comes with tradeoffs that matter for product decisions.

  • Latency. Every additional retrieval pass takes time. For a document Q&A tool where users are willing to wait, an agentic loop is fine. For a real-time customer support bot, added seconds may not be acceptable.

  • Cost. More retrieval passes lead to more embedding computations, vector database calls, and tokens through the LLM. An agentic loop averaging three retrieval rounds roughly triples your retrieval costs compared to classic RAG.

  • Complexity and observability. Agentic systems have more moving parts (evaluators, routers, reflection steps), and each is a new failure mode. When a classic RAG answer is wrong, you can trace it. When an agentic RAG answer is wrong, it is harder to reconstruct. Multiple retrieval rounds and intermediate evaluations must be logged and interpretable.

These aren't reasons to avoid agentic RAG.

They are reasons to be deliberate about when to use it.

When It Makes Sense

The practical engineering team's use comes down to query complexity.

For simple, single-hop questions against a well-maintained knowledge base, classic RAG is the perfect choice. It's fast, cheap, and debuggable.

For questions that require reasoning across multiple sources, that are often underspecified, or where the cost of a wrong answer is high, agentic patterns win.

An iterative RAG wins for a legal research assistant, a code assistant pulling context from multiple files, and a financial analysis tool synthesising across data sources.

Adaptive RAG is the default for systems that serve both query types. The classifier routes simple queries through classic RAG and sends hard ones through the full loop.

The more important point is that agentic RAG is not a single thing you switch on.

It is a set of composable decisions: should you add a retrieval quality check? Should you rewrite bad queries? Should you loop? Should you route to web search when internal sources fail?

Each can be added incrementally, driven by the actual failure modes you see in production. Nobody ships full Self-RAG on day one.

Teams typically start with classic RAG, add a relevance check when they see retrieval quality problems, add query rewriting when they see query-document mismatch, and add adaptive routing when they want to optimise for cost.

The architecture grows from the failures, not from a blueprint.

The question teams are asking now isn't "should we use RAG?"

It's "which retrieval decisions should the system make autonomously, and where does that autonomy introduce unacceptable risk?"

That's a different kind of product question than the ones RAG raised originally. And it's a much more interesting one.

Login or Subscribe to participate

Reply

Avatar

or to participate

Keep Reading