Evaluating RAG applications

Evaluating RAG applications can often be challenging. How do you design metrics that measures can measure how well your retrieval pipeline is doing or whether the LLM hallucinated? Not Diamond provides a simple framework for evaluating RAG applications with easy to use metrics that leverage LLM-as-a-judge to help developers evaluate different parts of their RAG workflow. Not Diamond can also help developers identify more suitable LLMs for response generation by automatically running evaluations on other user specified LLMs in the background. Below, we will go through an example that demonstrates how to use Not Diamond to evaluate your RAG application.

Installation

Python: Requires Python 3.9+. It’s recommended that you create and activate a virtualenv prior to installing the package. For this example, we'll be installing the optional additional rag dependencies as well as the datasets package from Huggingface.

pip install notdiamond[rag] datasets

Setting up

Create a .env file with the API keys of the models you want to use:

OPENAI_API_KEY = "YOUR_OPENAI_API_KEY"
ANTHROPIC_API_KEY = "YOUR_ANTHROPIC_API_KEY"

Evaluating RAG pipelines

In this example, we will use the Amnesty QA RAG dataset, which we will use as the RAG pipeline data, containing the query, retrieved contexts, and LLM responses. If you need help generating a test dataset for your RAG pipeline, you can check out our guide here.

from datasets import load_dataset
from notdiamond.toolkit.rag.evaluation_dataset import RAGEvaluationDataset, RAGSample
from notdiamond.toolkit.rag.evaluation import evaluate
from notdiamond.toolkit.rag.metrics import ContextRecall, Faithfulness, FactualCorrectness
from notdiamond.toolkit.rag.llms import get_llm
from notdiamond import LLMConfig

def format_prompt(user_input, retrieved_contexts):
    """
    Helper method to format the prompt for RAG generation.
    """
    context = "\n".join(retrieved_contexts)
    prompt = f"""
    Use the following context to answer the question.
    
    Context: {context}
    
    Question: {user_input}
    """
    return prompt

# Download the Amnesty QA RAG dataset from Huggingface
dataset = load_dataset(
    "explodinggradients/amnesty_qa",
    "english_v3",
    trust_remote_code=True
)["eval"]

# Create the RAGSamples from the dataset
samples = []
for sample in dataset:
    rag_sample = RAGSample(
        user_input=sample["user_input"],
        retrieved_contexts=sample["retrieved_contexts"],
        response=sample["response"],
        reference=sample["reference"],
        generation_prompt=format_prompt(sample["user_input"], sample["retrieved_contexts"]),
        generator_llm="openai/gpt-4o"
    )
    samples.append(rag_sample)

# Create the RAGEvaluationDataset using RAGSamples
eval_dataset = RAGEvaluationDataset(samples)

# Define the evaluator LLM
evaluator_llm = get_llm("openai/gpt-4o")

# Define the metrics to evaluate the RAG pipeline
metrics = [
    ContextRecall(llm=evaluator_llm),
    Faithfulness(llm=evaluator_llm),
]

# Define additional generator LLMs that you care about
generator_llms = [
    LLMConfig.from_string("openai/gpt-4o-mini"),
    LLMConfig.from_string("anthropic/claude-3-5-sonnet-20241022")
]

# Evaluate the RAGEvaluation dataset against GPT-4o, GPT-4o-mini, and Claude 3.5 Sonnet
results = evaluate(dataset=eval_dataset, metrics=metrics, generator_llms=generator_llms)

# Print the results DataFrame for GPT-4o-mini
print("Results for openai/gpt-4o-mini")
results["openai/gpt-4o-mini"].head()

Breaking down this example

First we load our prepared Amnesty QA RAG dataset using the RAGSample and RAGEvaluationDataset data schemas.

# Download the Amnesty QA RAG dataset from Huggingface
dataset = load_dataset(
    "explodinggradients/amnesty_qa",
    "english_v3",
    trust_remote_code=True
)["eval"]

# Create the RAGSamples from the dataset
samples = []
for sample in dataset:
    rag_sample = RAGSample(
        user_input=sample["user_input"],
        retrieved_contexts=sample["retrieved_contexts"],
        response=sample["response"],
        reference=sample["reference"],
        generation_prompt=format_prompt(sample["user_input"], sample["retrieved_contexts"]),
        generator_llm="openai/gpt-4o"
    )
    samples.append(rag_sample)

Each sample in this dataset represents key elements in a typical RAG pipeline:

  • user_input: the query from the user.
  • retrieved_contexts: a list of strings retrieved from a document store based on the user_input.
  • response: the response generated by the LLM in response to the user_input based on the retrieved_context.
  • reference: the ground truth answer to the user_input based on the retrieved_contexts.
  • generation_prompt: the prompt used to query the LLM containing both the user_input as well as the retrieved_contexts.
  • generator_llm: the LLM used to generate the response.

Next, we define a set of metrics that we want to evaluate the responses with.

# Define the evaluator LLM
evaluator_llm = get_llm("openai/gpt-4o")

# Define the metrics to evaluate the RAG pipeline
metrics = [
    ContextRecall(llm=evaluator_llm),
    Faithfulness(llm=evaluator_llm),
]

Here, we use the following metrics that helps us evaluate various parts of the RAG pipeline:

  • LLMContextRecall: measures how many of the relevant documents (or pieces of information) were successfully retrieved. It focuses on not missing important results.
  • Faithfulness: measures the factual consistency of the generated answer against the given context. It is calculated from response and retrieved_contexts. A low faithfulness score can indicate hallucinations in the LLM response.

Both of these metrics use LLMs to help perform the measurement, so we define an evaluator_llm using openai/gpt-4o. You can use any of the LLMs that we support. For a list of metrics that Not Diamond supports, as well as how to define custom metrics, see evaluation metrics.

Finally, before we evaluate the dataset, we define a set of generator_llms that we are interested in for response generation. Here you can provide additional generator LLMs that were not used to generate the responses in the dataset and Not Diamond will automatically generate the evaluation results for you on those LLMs in addition to the one defined in the dataset.

# Define additional generator LLMs that you care about
generator_llms = [
    LLMConfig.from_string("openai/gpt-4o-mini"),
    LLMConfig.from_string("anthropic/claude-3-5-sonnet-20241022")
]

# Evaluate the RAGEvaluation dataset against GPT-4o, GPT-4o-mini, and Claude 3.5 Sonnet
results = evaluate(dataset=eval_dataset, metrics=metrics, generator_llms=generator_llms)

Wrapping up

Now that we've completed our evaluations, our results dictionary can be used directly with Not Diamond's custom router training workflow to create a custom router that uses the most suitable LLM for each RAG query.