Introduction

The Retrieval-augmented Generation (RAG) framework combines the benefits of information retrieval systems with the generative capability of large language models. RAG is particularly useful in tasks that require a deep understanding of the query to generate contextually relevant responses.

RAG workflow

RAG involves two main components: a document retriever and a large language model (LLM). The retriever is responsible for finding relevant documents based on the input query and the generator uses the retrieved documents and the original query to generate a response.

Basic RAG pipeline
Basic workflow of RAG (Source: Haystack)
  1. The Retiever
    • The retriever embeds documents and queries into a high-dimensional vector space, typically stored in a vector database, for efficient vector similarity searches.
    • Retrieval accuracy is determined measuring the distance between document vectors and the query vector, with a threshold for relevance, before passing relevant documents to the generator.
  2. The Generator
    • The retrieved documents from the retriever and the input query are concatenated and fed into the generator.
    • The generator uses this combined input along with the retrieved documents for additional context to generate a more informed and accurate response with reducing hallucinations.

Why do we need RAG?

  1. Reduces hallucinations: When LLMs are not supplied with factual information, they often provide faulty, but convincing responses known as hallucinations. RAG reduces the likelihood of hallucinations by providing the LLM with relevant and factional information.
  2. Real-time data access: RAG facilitates direct access to additioal data resources
  3. Cost-effective: Fine-tuning a pre-trained LLM is a resource-intensive process. RAG offers a cost-effective alternative by augmenting an existing model with retrieving mechanisms.
  4. Preserves data-privacy: With a self-hosted LLM, sensitive data can be kept on-premises thus ensuring data privacy.
  5. Transparency and improved accuracy: A major concern is that AI models are “black box” in nature. RAG retrieves relevant data before crafting a response. It can cite sources it draws the response from instilling trust in users.
  6. Scalability: RAG accomodates growing information needs and has the abiliy to adapt to increased data and user interactions without compromise in performance or accuracy.

Naive RAG architecture

RAG architecture
Overall workflow of RAG (Source: Langchain)

Step 1: Document Loading

Document laoders provide a “load” method to load data as documents into the memory from a configured source. The SimpleDirectoryReader in Llamaindex creates documents out of every file in a given directory and can read a variety of formats including Markdown, PDFs, Word documents, images etc. If there are other unsupported formats, we can write our own parsers or use Unstructured.io which has a lot of inbuilt extractors to handle multiple formats.

Step 2: Document Transformation - Splitting/Chunking Documents

Once data is loaded, documents are often transformed to better suit the application. Splitting long documents into smaller chunks can fit the model’s context window and yield accurate results. However, smaller units may lose overall context, leading to inaccuracies. Hence, selecting the appropriate chunking strategy is crucial for effective RAG models.

Step 3: Embedding and Storing Vectors

Each document chunk is converted in a numerical vector representation called an “embedding”. This allows for efficient similarity search during retrieval. The generated dcoument embeddings are stored in a specialized database known as a “vector database”. Vector databases are optimized for performing fast similarity searches based on vector distances.

Step 4: Retrieval

Given a query, the retrieval component finds the most relevant document chunks from the database. First, the user query is converted into an embedding using the same method employed by documents. The query embedding is compared to stored document embeddings in the vector database using a similarity metric such as consine similarity and the chunks with the closest semantic similarities are retrieved.

Step 5: Generating output

The LLM takes the original input and the retrieved information into account and the generates the reponse. The presence of this additional context helps the model product more contextually appropriate and relevant outputs.

Technical implementation of RAG using Llamaindex

The code below represents implementation of naive RAG using Llamaindex. The script reads a text document, parses it into individual sentences, creates an index from the sentences, creates a query engine from the index, asks a question about the document, and prints the answer to the console.

Install llamaindex Python module using below

pip install llama-index

Llamaindex uses OpenAI embeddings by default to create embeddings for the documents. Create an OpenAI API key in the OpenAI website and add the following code

import openai
import os

OPENAI_API_KEY="sk-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" 
openai.api_key=OPENAI_API_KEY

The below code uses paul_graph_essay.txt as its knowledge base for RAG. Save it to the data directory and run the following code.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter

"""
Process the documents and perform a query.

Returns:
str: The response to the query.
"""

# data:folder name
documents = SimpleDirectoryReader(
                    input_files=["data/paul_graham_essay.txt"]).
                    load_data()

# parse documents into nodes
parser = SentenceSplitter()
nodes = parser.get_nodes_from_documents(documents)

# creates index
index = VectorStoreIndex.from_documents(documents)

# creates a query engine
query_engine = index.as_query_engine()

# ask questions about your documents
response = query_engine.query("What is the document about?")
print ("Response: ", response)

Response: The document is about the author’s personal journey and experiences with writing and programming, starting from his early days working on short stories and programming on an IBM 1401 in 9th grade, to transitioning to microcomputers, and eventually delving into the field of artificial intelligence. It also touches on his college experience studying philosophy before switching to AI, influenced by works like Heinlein’s “The Moon is a Harsh Mistress” and experiences with early computer technology.

Evaluating RAG models

The current landscape of evaluating LLM-based applications is still evolving and there are no comprehensive metrics to capture the quality of LLM outcomes. However, RAG system evaluations consist of two aspects:

  1. Retrieval Evaluation: To assess the accuracy and relevant of the retrieved documents
  2. Response Evaluation: To measure the appropriateness of the response generated by the system when the context was provided

Retrieval Evaluation

Retrieval evals in RAG model
Retrival Evals in RAG application (Source: Arize)

Response Evaluation

Response evals in RAG model
Response Evals in RAG application (Source: Arize)

Evaluating output using RAGAs

RAGAs is a popular framework that provides the necessary ingredients to help evaluate the RAG pipeline on a component level. Instead of using human-annotated ground truth labels in the evaluation dataset, RAGAs leverages LLMs under the hood to conduct the evaluations. It takes as input the user query, the RAG-generated answer, the retrieved contexts (and the human-annotated ground truth information only for context_recall metric).

For the retrieval component, RAGAs metrics are context_relevancy and context_recall, and the generative component metrics are faithfulness and answer_relevancy. All metrics range from [0,1], with higher values indicating a better performance.

References:

  1. Lewis, Patrick, et al. “Retrieval-augmented generation for knowledge-intensive nlp tasks.” Advances in Neural Information Processing Systems 33 (2020): 9459-9474.
  2. Llamaindex Docs, https://docs.llamaindex.ai/en/stable/
  3. Langchain Docs, https://python.langchain.com/docs/get_started/introduction
  4. RAG Evaluation, https://arize.com/blog-course/rag-evaluation/
  5. RAGAs LLM Evaluation Module, https://github.com/explodinggradients/ragas