Training a custom router
Not Diamond is a framework for training custom routing algorithms across a range of candidate LLMs on our evaluation data.
For any given distribution of data, rarely will one single model outperform every other model on every single query. By combining together multiple models into a "meta-model" that learns when to call each LLM, we can beat every individual model’s performance and even drive down costs and latency in the process.
Not Diamond integrates with any existing evaluation pipeline and is completely agnostic to our choice of evaluation methods, metrics, frameworks, and tools. All we need is the following three things:
- A set of LLM prompts: Prompts must be strings and should be representative of the prompts used in our application.
- LLM responses: The responses from candidate LLMs for each input. Candidate LLMs can include both our supported LLMs and your own custom models.
- Evaluation scores for responses to the inputs from candidate LLMs: Scores are numbers, and can be any metric that fit your needs.
Below, we will go through a Python example using evaluation results for openai/gpt-4o-2024-05-13
, openai/gpt-4-turbo-2024-04-09
, google/gemini-1.5-pro-latest
, anthropic/claude-3-opus-20240229
, and anthropic/claude-3-5-sonnet-20240620
on the Humaneval
dataset.
You can follow along with the code below, or try it in Colab.
Initialization
To get started, let's download the dataset that we've prepared for this example:
curl -L "https://drive.google.com/uc?export=download&id=1q1zNZHioy9B7M-WRjsJPkfvFosfaHX38" -o humaneval.csv
Next, we'll create a train.py
file and import pandas
and notdiamond.toolkit
, which we'll use to train our custom router:
import pandas as pd
from notdiamond.toolkit import CustomRouter
Training quickstart
We'll begin by loading the dataset using pandas.DataFrame
:
df = pd.read_csv("humaneval.csv")
print(df.columns)
Next we'll separate the data into individual LLM datasets. We will also define a test split so that we can later evaluate our custom router's performance, as well as an extra split to show how we can update our router with additional data.
llm_providers = [
"openai/gpt-4o-2024-05-13",
"openai/gpt-4-turbo-2024-04-09",
"google/gemini-1.5-pro-latest",
"anthropic/claude-3-opus-20240229",
"anthropic/claude-3-5-sonnet-20240620"
]
pzn_train = {}
pzn_test = {}
pzn_extra = {}
for provider in llm_providers:
provider_results = df.filter(
["Input", f"{provider}/response", f"{provider}/final_score"], axis=1
)
provider_results.rename(
columns={
f"{provider}/response": "response",
f"{provider}/final_score": "score"
},
inplace=True
)
# Create train/test/extra split
train = provider_results.sample(frac=0.8, random_state=420)
remainder = provider_results.drop(train.index)
test = remainder.sample(frac=0.9, random_state=420)
extra = remainder.drop(test.index)
pzn_train[provider] = train
pzn_test[provider] = test
pzn_extra[provider] = extra
Training data limitations
We encourage you to provide as much data with as many LLMs as you want to route between as possible. The minimum number of samples required is 15. However, we have some limits on how much data you can submit. You are allowed to upload up to 5mb of data or 10,000 samples per training job—reach out if you need support for larger file uploads.
Next, we'll use the CustomRouter
class in notdiamond.toolkit
to train our custom router:
# Initialize the CustomRouter object for training
trainer = CustomRouter(
language="english",
maximize=True # Indicate if higher scores are better (setting to False indicates the opposite)
)
# Train the model using your dataset
preference_id = trainer.fit(
dataset=pzn_train, # The dataset containing inputs, responses, and scores
prompt_column="Input", # Column name for the input prompts
response_column="response", # Column name for the model responses
score_column="score" # Column name for the scores
)
print("Custom router preference ID: ", preference_id)
Once we've submitted our evaluation, the fit
method will return a preference ID representing our custom router.
Training a custom router can take some time
When you call the
fit
method, we process your data and train a custom router to fit your needs. This can take up to 60 minutes depending on the size of your dataset. If the training is still in progress and you call Not Diamond using thepreference_id
returned, you will get an error asking you to wait until it has finished training.
Once our custom router has finished training, we can call it in our application by specifying the returned preference ID in our Not Diamond calls:
from notdiamond import NotDiamond
client = NotDiamond()
llm_providers = [
"openai/gpt-4o-2024-05-13",
"openai/gpt-4-turbo-2024-04-09",
"google/gemini-1.5-pro-latest",
"anthropic/claude-3-opus-20240229",
"anthropic/claude-3-5-sonnet-20240620"
]
messages = [
{"role": "user", "content": "Write merge sort in 3 lines."}
]
result, session_id, provider = client.chat.completions.create(
messages=messages,
model=llm_providers,
preference_id=preference_id, # Preference ID for our custom router
)
print("ND session ID: ", session_id)
print("LLM called: ", provider.model)
print("LLM output: ", result.content)
That wraps up our quickstart example. In the sections below we'll also walk through how to evaluate our custom router and how to update it over time.
Evaluating our custom router
We can evaluate the performance of the custom router using the test split of our evaluation dataset using the CustomRouter.eval
method. It returns two DataFrames:
eval_results
contains the prompts, responses from each LLM, and their corresponding scores. It also contains the response and scores that the custom router achieved for each prompt under the headingsnotdiamond/response
andnotdiamond/score
, respectively.eval_stats
provides average statistics for the best performing LLM as well the average score of the custom router.
results = trainer.eval(
dataset=pzn_test,
prompt_column="Input",
response_column="response",
score_column="score",
preference_id=preference_id
)
eval_results, eval_stats = results
print(eval_results)
"""
prompt openai/gpt-4o/response openai/gpt-4o/score ... notdiamond/response notdiamond/score
0 ...
...
"""
print(eval_stats)
"""
Best Average Provider Best Provider Average Score Not Diamond Average Score
0 ...
"""
Updating a custom router
We can update our custom router at any time by simply submitting more evaluation data to it. When updating an existing custom router, we should include both the original data and the new data we want to use for training as Not Diamond will only take into account the data we submit each time. We should also include the preference ID of our router if we'd like the same preference ID to be used for our updated router.
# Concatenate data for each model from the `train` and `extra` sets
pzn_updated = {}
for model in pzn_train.keys():
combined_df = pd.concat([pzn_train[model], pzn_extra[model]], ignore_index=True)
pzn_updated[model] = combined_df
# Use the updated data for custom routing
preference_id = trainer.fit(
dataset=pzn_updated,
prompt_column="Input",
response_column="response",
score_column="score",
preference_id=preference_id # Specify the preference_id associated with the custom router that you want to update
)
Calling the trainer while another job is still running cancels the previous job
If you update a router that's currently training, it will cancel the previous run and start a new run with the updated data you've submitted.
Using a custom model
Training custom routers is not limited to using models that we support. You can also include custom models that you've fine-tuned, or even an entire LLM workflow. Any arbitrary inference endpoint may be specified. Just include its evaluation results in the training data and you'll be able to use it in your routing decisions.
To define a custom model in your training, simply use the notdiamond.llms.config.LLMConfig
class. You need to define the context_length
, input_price
, output_price
, latency
and set is_custom=True
so that we can recommend the model appropriately.
The custom provider/model string needs to be unique
When defining a custom model, make sure the
<provider>/<model>
string is unique. It cannot be the same as a model that is supported or another custom model.
from notdiamond.llms.config import LLMConfig
custom_model = LLMConfig(
provider="custom",
model="model",
is_custom=True,
context_length=200000,
input_price=0.1, # USD per million tokens
output_price=0.2, # USD per million tokens
latency=0.01 # time to first token (seconds)
)
llm_providers = [
"openai/gpt-4o-2024-05-13",
"openai/gpt-4-turbo-2024-04-09",
"google/gemini-1.5-pro-latest",
"anthropic/claude-3-opus-20240229",
custom_model
]
pzn_train = {}
pzn_test = {}
pzn_extra = {}
for provider in llm_providers:
# The rest of the script follows from the example above
# Just make sure you have the evaluation results for custom_model
You can then call the router once it's done training, making sure to include your custom model config in the list of providers
Custom models can only use
model_select
Custom routers that have custom models can only use
model_select
since theNotDiamond
client does not have built in support for calling arbitrary models. Given the recommended model frommodel_select
, you can implement your own invoking logic for the custom model.
from notdiamond import NotDiamond
client = NotDiamond()
custom_model = LLMConfig(
provider="custom",
model="model",
is_custom=True
)
llm_providers = [
"openai/gpt-4o-2024-05-13",
"openai/gpt-4-turbo-2024-04-09",
"google/gemini-1.5-pro-latest",
"anthropic/claude-3-opus-20240229",
custom_model # Adding our custom model to the list
]
messages = [
{"role": "user", "content": "Write merge sort in 3 lines."}
]
session_id, provider = client.chat.completions.model_select(
messages=messages,
model=llm_providers,
preference_id=preference_id, # Preference ID for our custom router
)
print("ND session ID: ", session_id)
print("LLM called: ", provider.model)
Updated about 6 hours ago