LLM evaluation tutorial

In order to leverage Not Diamond's custom router training workflow, we need evaluation data that measures how various LLMs perform on our use case. This tutorial is meant as an introduction to LLM evaluation for anyone who is new to it. We'll walk through how to generate evaluation scores for LLM responses on a specific dataset, and then how to use these scores to train a custom router. We will use DeepEval as a framework for performing LLM evaluations in this example.

Initialization

We'll start by installing Not Diamond and DeepEval:

pip install 'notdiamond[create]' deepeval

We'll also create an eval.py file with the following import statements:

from pprint import pprint

from deepeval.benchmarks import HellaSwag
from deepeval.benchmarks.tasks import HellaSwagTask
from deepeval.models.base_model import DeepEvalBaseLLM
from langchain_core.prompts import ChatPromptTemplate

from notdiamond import LLMConfig, NotDiamond

Evaluating our data and LLMs

๐Ÿ‘

Evaluation agnostic

In this example we'll rely on DeepEval's exact match scorer. However, your evaluation scores can be based on any other evaluation metric, such as edit distance, human feedback, or LLM-as-a-judge scores.

Defining our LLM inputs

To get started, let's define our set of LLM inputs. In this example, we'll use the HellaSwag health dataset:

hellaswag = HellaSwag(tasks=[HellaSwagTask.HEALTH])
test_goldens = hellaswag.load_benchmark_dataset(task=HellaSwagTask.HEALTH)

pprint(vars(test_goldens[0])) # Print the first item in the benchmark

Generating and evaluating responses from candidate LLMs

Now that we've defined our inputs, let's generate responses for every LLM we'd like to include in our router and add the responses to our dataset. For this example we'll use GPT-4o, Claude Sonnet 3.5, and Llama-70B, but you're welcome to choose any combination of LLMs. You may be surprised by which models end up driving performance gains in your application.

First, we'llย create a wrapper class which combines both the Not Diamond SDK as well as DeepEval, then we'll generate responses directly from each model's API:

class BaseNDEvalLLM(DeepEvalBaseLLM):
  
    # Setting up
    def __init__(
        self,
        provider: str,
        system_prompt: str = "",
    ):
        self.ndllm = LLMConfig.from_string(provider) # Initialize the NDLLMProvider
        self.model = NotDiamond._llm_from_config(self.ndllm, None) # Create  LLM instance from  provider
        self.system_prompt = system_prompt # Store the system prompt

    def load_model(self):
        return self.model # Return the initialized model

    async def get_model_name(self, *args, **kwargs) -> str:
        return f"{self.ndllm.provider}/{self.ndllm.model}" # Return provider/model as a string

    @property
    def model_name(self, *args, **kwargs) -> str:
        return f"{self.ndllm.provider}/{self.ndllm.model}"

    # Generating responses 
    def generate(self, prompt: str) -> str:
        model = self.load_model() # Load the model
        # Prepare messages with system prompt and user prompt
        messages = [
            {"role": "system", "content": self.system_prompt},
            {"role": "user", "content": prompt},
        ]
        # Chain together the prompt template and model
        chain_messages = [
            (msg["role"], msg["content"]) for msg in messages
        ]
        prompt_template = ChatPromptTemplate.from_messages(chain_messages)
        chain = prompt_template | model
        # Invoke the chain and get the result
        result = chain.invoke({})
        return result if isinstance(result, str) else result.content

    async def a_generate(self, prompt: str) -> str: # async
        model = self.load_model()
        messages = [
            {"role": "system", "content": self.system_prompt},
            {"role": "user", "content": prompt},
        ]
        prompt_template = NotDiamond._prepare_prompt_template(None, messages)
        chain = prompt_template | model
        result = await chain.ainvoke({})
        return result if isinstance(result, str) else result.content

    def __hash__(self):
        return hash((self.ndllm.provider, self.ndllm.model, self.system_prompt))

    def __str__(self):
        return self.model_name

    def __repr__(self):
        return self.model_name

We can now use this wrapper class to evaluate models with Not Diamond and DeepEval's SDK.

๐Ÿšง

Running evaluations will incur inference costs

Remember that evaluations will incur inference costs based on the number of iterated LLM calls. For this tutorial, expect to spend $1-2 to run the whole dataset.

llm_providers = [
    "openai/gpt-4o-2024-05-13",
    "anthropic/claude-3-5-sonnet-20240620",
    "replicate/meta-llama-3-70b-instruct"
]

model_to_eval = {}
for model in llm_providers:
    print(f"Evaluating {model} on Hellaswag.HEALTH")
    nd_eval_model = BaseNDEvalLLM(model)
    model_to_eval[nd_eval_model.model_name] = hellaswag.evaluate(nd_eval_model)

for model_name, accuracy in model_to_eval.items():
    print(f"{model_name} on Hellaswag.HEALTH = {accuracy}")

๐Ÿ“˜

Joint prompt optimization

You can score all your models on the same prompt, or you can perform automatic prompt optimization to use the best prompt for each model. You can evaluate specific (prompt, model) pairs and pass them into Not Diamond just as described in this example, and Not Diamond will learn to route between the best (prompt, model) pairs for each query.

Training a custom router

Now that we've familiarized ourselves with LLM evaluation and created a dataset, we can now train a custom router optimized to our use case. To continue on with this, head over to the custom router training quickstart page.