Is RAG Really Dead?

What is RAG?

Much of this blog post came from these sources:

  1. YouTube video
  2. X Thread
  3. Google Slides

RAG, or Retrieval-Augmented Generation is a way for an LLM without strong domain knowledge, or when some knowledge in question regularly changes (e.g., a knowledge base, or a regularly changing industry) to retrieve data from a source, combine it with it's own domain knowledge, and better answer questions to the user.

This Nvidia blog post uses the courtroom as an analogy. Imagine a courtroom where the judge decides to hear a case based on general understanding of the law. But in this case, the law is not part of that general understanding, and so he/she sends a clerk to do some research in a law library. The clerk finds precedents and specific cases that the judge can use.

Like a good judge, LLMs can respond to a wide variety of human queries. However, to deliver authoritative answers and cite specific sources, the model may need some assistance to do some research. If we tie this back to the courtroom example, RAG becomes the court clerk.

RAG works by adding another data source to the LLM, the LLM will use what it "knows" based on what it was trained on, plus the data source provided by RAG, to better answer specific questions, and provide citations. This is very similar to footnotes, where the end-user can find additional information and validate the LLMs answers.

Why is it important?

There are many users cases where just answer the user's query isn't sufficient. The user may want specific information that changes regularly, or citations to be able to validate the answers, and even do further research. For example, say you worked at a corporation. You have a question about the travel policy and reimbursements. Specifically, you want to know if you can submit a receipt for a day pass to the airline lounge since your flight was delayed. You recall reading something about this during your onboarding, but that was years ago, and it's likely changed since then. You can, of course, download the PDF that contains the policy, but that's not very fun, and so 2000s.

Instead, you use the companies employee AI, and pose the question. Not only does it give you the answer, but a link to the PDF and what chapter and page it's on. That AI was generally trained on human language, like most LLMs are, and then your company wisely uses RAG to augment the base understanding of the LLM with specific knowledge that only applies to your company. So questions can have good reliable answers, and citations that can be used to validate and confirm the answer.

So why is it "dying"?

RAG is "dying" because of the context windows that are becoming prevalent. Such as Claude 3's or Google Gemini's ~1M token context window. This allows that travel policy to be embedded in the context window, and provide the same sort of answer that the LLM + RAG solution would provide. Or at least one would assume. But it's not quite that simple.

One thing that LLMs are both good and bad at is what's called "needle in a haystack" scenarios. These tests are designed to test the reasoning and retrieval abilities of LLMs with large context windows. One method for performing this test was designed by Greg Kamradtusing pizza toppings embedded into a large document.

In one specific case, the "needles" were Figs, Prosciutto, and Goat cheese. These phrases were embedded into a large Paul Graham essay (the "haystack"). Then the LLM was asked about the best toppings for pizza, some LLMs performed better than others. And the # of needles had a direct correlation in performance, as well as the placement of the "needles". Interestingly, LLMs seemed to do better than the needles were placed near the end of the haystack, and worse when near the beginning. Additionally, the performance decreased as the number of needles grew.

For instance, with a single needle, the LLMs typically were able to provide satisfactory results, but dropped to nearly 50% when you got to 10 needles. Additionally, retrieval seemed easier for the LLM than reasoning, something that LLMs have heretofore been less optimal with, reasoning and planning.

You may be wondering then, given all of this, why is anyone still using RAG? Given very large context windows do ok, even with many needles, certainly good enough for most use cases. One other factor may have something to do with it. That is cost, large context window LLMs tend to cost more to make queries, and the number of tokens contributes to overall costs to a business. With a smaller, open source LLM, and a well created and maintained RAG knowledge base, some companies can keep costs down.

Summary

So for now, RAG isn't dead per se. It would seem that the majority of the use cases that it made sense to use RAG for, could now be more easily done with large context window LLMs, albeit at an elevated cost. And at some point, as costs come down, context windows continue to grow, and the ability for LLMs to reason and plan better, it's likely that RAG will not be a necessary item. But for now, there are certainly use cases where it still makes sense.

Previous
Previous

Local and Free Co-Pilot

Next
Next

What is CrewAI, and why do I care?