Evaluation metrics

Not Diamond supports all metrics included in ragas. To simplify the interface to ragas metrics, we provide the get_llm and get_embedding utility methods to construct ragas compatible LLM and embedding models for metrics. Below are some commonly used metrics and usage patterns. Users can also define any custom metrics.

Context precision

ContextPrecision is a metric that measures the proportion of relevant chunks in the retrieved_contexts. It is calculated as the mean of the precision@k for each chunk in the context. Precision@k is the ratio of the number of relevant chunks at rank k to the total number of chunks at rank k.

from notdiamond.toolkit.rag.evaluation_dataset import RAGSample
from notdiamond.toolkit.rag.llms import get_llm
from notdiamond.toolkit.rag.metrics import ContextPrecision

evaluator_llm = get_llm("openai/gpt-4o")
context_precision = ContextPrecision(llm=evaluator_llm)

sample = RAGSample(
    user_input="Where is the Eiffel Tower located?",
  	response="The Eiffel Tower is located in Paris.",
    reference="The Eiffel Tower is located in Paris.",
    retrieved_contexts=["The Eiffel Tower is located in Paris."],
    generator_llm="openai/gpt-4o",
    generation_prompt=""
)

score = await context_precision.single_turn_ascore(sample)

ContextPrecisionWithoutReference

ContextPrecisionWithoutReference metric is can be used when you have both retrieved contexts and also reference contexts associated with a user_input. To estimate if a retrieved contexts is relevant or not this method uses the LLM to compare each of the retrieved context or chunk present in retrieved_contexts with response.

from notdiamond.toolkit.rag.evaluation_dataset import RAGSample
from notdiamond.toolkit.rag.llms import get_llm
from notdiamond.toolkit.rag.metrics import ContextPrecisionWithoutReference

evaluator_llm = get_llm("openai/gpt-4o")
context_precision = ContextPrecisionWithoutReference(llm=evaluator_llm)

sample = RAGSample(
    user_input="Where is the Eiffel Tower located?",
    response="The Eiffel Tower is located in Paris.",
    retrieved_contexts=["The Eiffel Tower is located in Paris."],
    generator_llm="openai/gpt-4o",
    generation_prompt=""
)

score = await context_precision.single_turn_ascore(sample)

ContextRecall

ContextRecall is computed using user_input, reference and the retrieved_contexts, and the values range between 0 and 1, with higher values indicating better performance. This metric uses reference as a proxy to reference_contexts which also makes it easier to use as annotating reference contexts can be very time consuming. To estimate context recall from the reference, the reference is broken down into claims each claim in the reference answer is analyzed to determine whether it can be attributed to the retrieved context or not.

from notdiamond.toolkit.rag.evaluation_dataset import RAGSample
from notdiamond.toolkit.rag.llms import get_llm
from notdiamond.toolkit.rag.metrics import ContextRecall

evaluator_llm = get_llm("openai/gpt-4o")
context_recall = ContextRecall(llm=evaluator_llm)

sample = RAGSample(
    user_input="Where is the Eiffel Tower located?",
    response="The Eiffel Tower is located in Paris.",
    reference="The Eiffel Tower is located in Paris.",
    retrieved_contexts=["Paris is the capital of France."],
    generator_llm="openai/gpt-4o",
    generation_prompt=""
)

score = await context_recall.single_turn_ascore(sample)

Faithfulness

Faithfulness metric measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better.

The generated answer is regarded as faithful if all the claims made in the answer can be inferred from the given context. To calculate this, a set of claims from the generated answer is first identified. Then each of these claims is cross-checked with the given context to determine if it can be inferred from the context. A low faithfulness score can be used to identify instances of hallucination.

from notdiamond.toolkit.rag.evaluation_dataset import RAGSample
from notdiamond.toolkit.rag.llms import get_llm
from notdiamond.toolkit.rag.metrics import Faithfulness

evaluator_llm = get_llm("openai/gpt-4o")
faithfulness = Faithfulness(llm=evaluator_llm)

sample = RAGSample(
    user_input="Where is the Eiffel Tower located?",
    response="The Eiffel Tower is located in Paris.",
    retrieved_contexts=["The Eiffel Tower is located in Paris."],
    generator_llm="openai/gpt-4o",
    generation_prompt=""
)

score = await faithfulness.single_turn_ascore(sample)

NoiseSensitivity

NoiseSensitivity measures how often a system makes errors by providing incorrect responses when utilizing either relevant or irrelevant retrieved documents. The score ranges from 0 to 1, with lower values indicating better performance. Noise sensitivity is computed using the user_input, reference, response, and the retrieved_contexts.

To estimate noise sensitivity, each claim in the generated response is examined to determine whether it is correct based on the ground truth and whether it can be attributed to the relevant (or irrelevant) retrieved context.

from notdiamond.toolkit.rag.evaluation_dataset import RAGSample
from notdiamond.toolkit.rag.llms import get_llm
from notdiamond.toolkit.rag.metrics import NoiseSensitivity

evaluator_llm = get_llm("openai/gpt-4o")
noise_sensitivity = NoiseSensitivity(llm=evaluator_llm)

sample = RAGSample(
    user_input="What is the Life Insurance Corporation of India (LIC) known for?",
    response="The Life Insurance Corporation of India (LIC) is the largest insurance company in India, known for its vast portfolio of investments. LIC contributes to the financial stability of the country.",
    reference="The Life Insurance Corporation of India (LIC) is the largest insurance company in India, established in 1956 through the nationalization of the insurance industry. It is known for managing a large portfolio of investments.",
    retrieved_contexts=[
        "The Life Insurance Corporation of India (LIC) was established in 1956 following the nationalization of the insurance industry in India.",
        "LIC is the largest insurance company in India, with a vast network of policyholders and huge investments.",
        "As the largest institutional investor in India, LIC manages substantial funds, contributing to the financial stability of the country.",
        "The Indian economy is one of the fastest-growing major economies in the world, thanks to sectors like finance, technology, manufacturing etc."
    ],
  	generator_llm="openai/gpt-4o",
    generation_prompt=""
)

score = await noise_sensitivity.single_turn_ascore(sample)

ResponseRelevancy

ResponseRelevancy metric focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This metric is computed using the user_input, the retrived_contexts and the response.

An answer is deemed relevant when it directly and appropriately addresses the original question. Importantly, the assessment of answer relevance does not consider factuality but instead penalizes cases where the answer lacks completeness or contains redundant details. To calculate this score, the LLM is prompted to generate an appropriate question for the generated answer multiple times, and the mean cosine similarity between these generated questions and the original question is measured. The underlying idea is that if the generated answer accurately addresses the initial question, the LLM should be able to generate questions from the answer that align with the original question.

from notdiamond.toolkit.rag.evaluation_dataset import RAGSample
from notdiamond.toolkit.rag.llms import get_llm, get_embedding
from notdiamond.toolkit.rag.metrics import ResponseRelevancy

evaluator_llm = get_llm("openai/gpt-4o")
evaluator_embedding = get_embedding("openai/text-embedding-3-large")
resposne_relevancy = ResponseRelevancy(llm=evaluator_llm, embeddings=evaluator_embedding)

sample = RAGSample(
        user_input="When was the first super bowl?",
        response="The first superbowl was held on Jan 15, 1967",
        retrieved_contexts=[
            "The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles."
        ],
        generator_llm="openai/gpt-4o",
        generation_prompt=""
    )

score = await resposne_relevancy.single_turn_ascore(sample)