LLM evaluation tutorial
In order to leverage Not Diamond's custom router training workflow, we need evaluation data that measures how various LLMs perform on our use case. This tutorial is meant as an introduction to LLM evaluation for anyone who is new to it. We'll walk through how to generate evaluation scores for LLM responses on a specific dataset, and then how to use these scores to train a custom router. We will use DeepEval as a framework for performing LLM evaluations in this example.
Initialization
We'll start by installing Not Diamond and DeepEval:
pip install 'notdiamond[create]' deepeval
We'll also create an eval.py
file with the following import statements:
from pprint import pprint
from deepeval.benchmarks import HellaSwag
from deepeval.benchmarks.tasks import HellaSwagTask
from deepeval.models.base_model import DeepEvalBaseLLM
from langchain_core.prompts import ChatPromptTemplate
from notdiamond import LLMConfig, NotDiamond
Evaluating our data and LLMs
Evaluation agnostic
In this example we'll rely on DeepEval's exact match scorer. However, your evaluation scores can be based on any other evaluation metric, such as edit distance, human feedback, or LLM-as-a-judge scores.
Defining our LLM inputs
To get started, let's define our set of LLM inputs. In this example, we'll use the HellaSwag health dataset:
hellaswag = HellaSwag(tasks=[HellaSwagTask.HEALTH])
test_goldens = hellaswag.load_benchmark_dataset(task=HellaSwagTask.HEALTH)
pprint(vars(test_goldens[0])) # Print the first item in the benchmark
Generating and evaluating responses from candidate LLMs
Now that we've defined our inputs, let's generate responses for every LLM we'd like to include in our router and add the responses to our dataset. For this example we'll use GPT-4o, Claude Sonnet 3.5, and Llama-70B, but you're welcome to choose any combination of LLMs. You may be surprised by which models end up driving performance gains in your application.
First, we'llΒ create a wrapper class which combines both the Not Diamond SDK as well as DeepEval, then we'll generate responses directly from each model's API:
class BaseNDEvalLLM(DeepEvalBaseLLM):
# Setting up
def __init__(
self,
provider: str,
system_prompt: str = "",
):
self.ndllm = LLMConfig.from_string(provider) # Initialize the NDLLMProvider
self.model = NotDiamond._llm_from_config(self.ndllm, None) # Create LLM instance from provider
self.system_prompt = system_prompt # Store the system prompt
def load_model(self):
return self.model # Return the initialized model
async def get_model_name(self, *args, **kwargs) -> str:
return f"{self.ndllm.provider}/{self.ndllm.model}" # Return provider/model as a string
@property
def model_name(self, *args, **kwargs) -> str:
return f"{self.ndllm.provider}/{self.ndllm.model}"
# Generating responses
def generate(self, prompt: str) -> str:
model = self.load_model() # Load the model
# Prepare messages with system prompt and user prompt
messages = [
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": prompt},
]
# Chain together the prompt template and model
chain_messages = [
(msg["role"], msg["content"]) for msg in messages
]
prompt_template = ChatPromptTemplate.from_messages(chain_messages)
chain = prompt_template | model
# Invoke the chain and get the result
result = chain.invoke({})
return result if isinstance(result, str) else result.content
async def a_generate(self, prompt: str) -> str: # async
model = self.load_model()
messages = [
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": prompt},
]
prompt_template = NotDiamond._prepare_prompt_template(None, messages)
chain = prompt_template | model
result = await chain.ainvoke({})
return result if isinstance(result, str) else result.content
def __hash__(self):
return hash((self.ndllm.provider, self.ndllm.model, self.system_prompt))
def __str__(self):
return self.model_name
def __repr__(self):
return self.model_name
We can now use this wrapper class to evaluate models with Not Diamond and DeepEval's SDK.
Running evaluations will incur inference costs
Remember that evaluations will incur inference costs based on the number of iterated LLM calls. For this tutorial, expect to spend $1-2 to run the whole dataset.
llm_providers = [
"openai/gpt-4o-2024-05-13",
"anthropic/claude-3-5-sonnet-20240620",
"replicate/meta-llama-3-70b-instruct"
]
model_to_eval = {}
for model in llm_providers:
print(f"Evaluating {model} on Hellaswag.HEALTH")
nd_eval_model = BaseNDEvalLLM(model)
model_to_eval[nd_eval_model.model_name] = hellaswag.evaluate(nd_eval_model)
for model_name, accuracy in model_to_eval.items():
print(f"{model_name} on Hellaswag.HEALTH = {accuracy}")
Joint prompt optimization
You can score all your models on the same prompt, or you can perform automatic prompt optimization to use the best prompt for each model. You can evaluate specific (prompt, model) pairs and pass them into Not Diamond just as described in this example, and Not Diamond will learn to route between the best (prompt, model) pairs for each query.
Training a custom router
Now that we've familiarized ourselves with LLM evaluation and created a dataset, we can now train a custom router optimized to our use case. To continue on with this, head over to the custom router training quickstart page.
Updated 5 months ago