Not Diamond is a framework for training custom routing algorithms across a range of candidate LLMs on our evaluation data.

For any given distribution of data, rarely will one single model outperform every other model on every single query. By combining together multiple models into a "meta-model" that learns when to call each LLM, we can beat every individual model’s performance and even drive down costs and latency in the process.

Not Diamond integrates with any existing evaluation pipeline and is completely agnostic to our choice of evaluation methods, metrics, frameworks, and tools. All we need is the following three things:

A set of LLM prompts: Prompts must be strings and should be representative of the prompts used in our application.
LLM responses: The responses from candidate LLMs for each input. Candidate LLMs can include both our supported LLMs and your own custom models.
Evaluation scores for responses to the inputs from candidate LLMs: Scores are numbers, and can be any metric that fit your needs.

Below, we will go through a Python example using evaluation results for openai/gpt-4o-2024-05-13, openai/gpt-4-turbo-2024-04-09, google/gemini-1.5-pro-latest, anthropic/claude-3-opus-20240229, and anthropic/claude-3-5-sonnet-20240620 on the Humaneval dataset.

👍
Try it in colab
You can follow along with the code below, or try it in Colab.

Initialization

To get started, let's download the dataset that we've prepared for this example:

curl -L "https://drive.google.com/uc?export=download&id=1q1zNZHioy9B7M-WRjsJPkfvFosfaHX38" -o humaneval.csv

Next, we'll create a train.py file and import pandas and notdiamond.toolkit, which we'll use to train our custom router:

import pandas as pd
from notdiamond.toolkit import CustomRouter

Training quickstart

We'll begin by loading the dataset using pandas.DataFrame:

df = pd.read_csv("humaneval.csv")
print(df.columns)

Next we'll separate the data into individual LLM datasets. We will also define a test split so that we can later evaluate our custom router's performance, as well as an extra split to show how we can update our router with additional data.

llm_providers = [
    "openai/gpt-4o-2024-05-13",
    "openai/gpt-4-turbo-2024-04-09",
    "google/gemini-1.5-pro-latest",
    "anthropic/claude-3-opus-20240229",
    "anthropic/claude-3-5-sonnet-20240620"
]

pzn_train = {}
pzn_test = {}
pzn_extra = {}
for provider in llm_providers:
    provider_results = df.filter(
        ["Input", f"{provider}/response", f"{provider}/final_score"], axis=1
    )
    provider_results.rename(
        columns={
            f"{provider}/response": "response",
            f"{provider}/final_score": "score"
        },
        inplace=True
    )
    
    # Create train/test/extra split
    train = provider_results.sample(frac=0.8, random_state=420)
    remainder = provider_results.drop(train.index)
    test = remainder.sample(frac=0.9, random_state=420)
    extra = remainder.drop(test.index)
    
    pzn_train[provider] = train
    pzn_test[provider] = test
    pzn_extra[provider] = extra

🚧
Training data limitations
We encourage you to provide as much data with as many LLMs as you want to route between as possible. The minimum number of samples required is 15. However, we have some limits on how much data you can submit. You are allowed to upload up to 5mb of data or 10,000 samples per training job—reach out if you need support for larger file uploads.

Next, we'll use the CustomRouter class in notdiamond.toolkit to train our custom router:

# Initialize the CustomRouter object for training
trainer = CustomRouter(
    language="english",
    maximize=True  # Indicate if higher scores are better (setting to False indicates the opposite)
)

# Train the model using your dataset
preference_id = trainer.fit(
    dataset=pzn_train, # The dataset containing inputs, responses, and scores
    prompt_column="Input", # Column name for the input prompts
    response_column="response", # Column name for the model responses
    score_column="score" # Column name for the scores
)

print("Custom router preference ID: ", preference_id)

Once we've submitted our evaluation, the fit method will return a preference ID representing our custom router.

📘
Training a custom router can take some time
When you call the fit method, we process your data and train a custom router to fit your needs. This can take up to 60 minutes depending on the size of your dataset. If the training is still in progress and you call Not Diamond using the preference_id returned, you will get an error asking you to wait until it has finished training.

Once our custom router has finished training, we can call it in our application by specifying the returned preference ID in our Not Diamond calls:

from notdiamond import NotDiamond

client = NotDiamond()

llm_providers = [
    "openai/gpt-4o-2024-05-13",
    "openai/gpt-4-turbo-2024-04-09",
    "google/gemini-1.5-pro-latest",
    "anthropic/claude-3-opus-20240229",
    "anthropic/claude-3-5-sonnet-20240620"
]

messages = [
  {"role": "user", "content": "Write merge sort in 3 lines."}
]

result, session_id, provider = client.chat.completions.create(
    messages=messages,
    model=llm_providers,
    preference_id=preference_id, # Preference ID for our custom router
)

print("ND session ID: ", session_id)
print("LLM called: ", provider.model)
print("LLM output: ", result.content)

That wraps up our quickstart example. In the sections below we'll also walk through how to evaluate our custom router and how to update it over time.

Evaluating our custom router

We can evaluate the performance of the custom router using the test split of our evaluation dataset using the CustomRouter.eval method. It returns two DataFrames:

eval_results contains the prompts, responses from each LLM, and their corresponding scores. It also contains the response and scores that the custom router achieved for each prompt under the headings notdiamond/response and notdiamond/score, respectively.
eval_stats provides average statistics for the best performing LLM as well the average score of the custom router.

results = trainer.eval(
    dataset=pzn_test,
    prompt_column="Input",
    response_column="response",
    score_column="score",
    preference_id=preference_id
)
eval_results, eval_stats = results

print(eval_results)
"""
   prompt  openai/gpt-4o/response  openai/gpt-4o/score ... notdiamond/response  notdiamond/score
0  ...
...
"""

print(eval_stats)
"""
   Best Average Provider  Best Provider Average Score  Not Diamond Average Score
0  ...
"""

Updating a custom router

We can update our custom router at any time by simply submitting more evaluation data to it. When updating an existing custom router, we should include both the original data and the new data we want to use for training as Not Diamond will only take into account the data we submit each time. We should also include the preference ID of our router if we'd like the same preference ID to be used for our updated router.

# Concatenate data for each model from the `train` and `extra` sets
pzn_updated = {}
for model in pzn_train.keys():
    combined_df = pd.concat([pzn_train[model], pzn_extra[model]], ignore_index=True)
    pzn_updated[model] = combined_df

# Use the updated data for custom routing
preference_id = trainer.fit(
    dataset=pzn_updated,
    prompt_column="Input",
    response_column="response",
    score_column="score",
    preference_id=preference_id  # Specify the preference_id associated with the custom router that you want to update
)

🚧
Calling the trainer while another job is still running cancels the previous job
If you update a router that's currently training, it will cancel the previous run and start a new run with the updated data you've submitted.

Using a custom model

Training custom routers is not limited to using models that we support. You can also include custom models that you've fine-tuned, or even an entire LLM workflow. Any arbitrary inference endpoint may be specified. Just include its evaluation results in the training data and you'll be able to use it in your routing decisions.

To define a custom model in your training, simply use the notdiamond.llms.config.LLMConfig class. You need to define the context_length, input_price, output_price, latency and set is_custom=True so that we can recommend the model appropriately.

🚧
The custom provider/model string needs to be unique
When defining a custom model, make sure the <provider>/<model> string is unique. It cannot be the same as a model that is supported or another custom model.

from notdiamond.llms.config import LLMConfig

custom_model = LLMConfig(
    provider="custom",
    model="model",
    is_custom=True,
    context_length=200000,
    input_price=0.1, # USD per million tokens
    output_price=0.2, # USD per million tokens
    latency=0.01 # time to first token (seconds)
)

llm_providers = [
    "openai/gpt-4o-2024-05-13",
    "openai/gpt-4-turbo-2024-04-09",
    "google/gemini-1.5-pro-latest",
    "anthropic/claude-3-opus-20240229",
    custom_model
]

pzn_train = {}
pzn_test = {}
pzn_extra = {}
for provider in llm_providers:
    # The rest of the script follows from the example above
    # Just make sure you have the evaluation results for custom_model

You can then call the router once it's done training, making sure to include your custom model config in the list of providers

🚧
Custom models can only use model_select
Custom routers that have custom models can only use model_select since the NotDiamond client does not have built in support for calling arbitrary models. Given the recommended model from model_select, you can implement your own invoking logic for the custom model.

from notdiamond import NotDiamond

client = NotDiamond()

custom_model = LLMConfig(
    provider="custom",
    model="model",
    is_custom=True
)

llm_providers = [
    "openai/gpt-4o-2024-05-13",
    "openai/gpt-4-turbo-2024-04-09",
    "google/gemini-1.5-pro-latest",
    "anthropic/claude-3-opus-20240229",
    custom_model  # Adding our custom model to the list
]

messages = [
  {"role": "user", "content": "Write merge sort in 3 lines."}
]

session_id, provider = client.chat.completions.model_select(
    messages=messages,
  	model=llm_providers,
  	preference_id=preference_id, # Preference ID for our custom router
)

print("ND session ID: ", session_id)
print("LLM called: ", provider.model)

Training a custom router

👍
Try it in colab

Initialization

Training quickstart

🚧
Training data limitations

📘
Training a custom router can take some time

Evaluating our custom router

Updating a custom router

🚧
Calling the trainer while another job is still running cancels the previous job

Using a custom model

🚧
The custom provider/model string needs to be unique

🚧
Custom models can only use `model_select`

👍Try it in colab

Initialization

Training quickstart

🚧Training data limitations

📘Training a custom router can take some time

Evaluating our custom router

Updating a custom router

🚧Calling the trainer while another job is still running cancels the previous job

Using a custom model

🚧The custom provider/model string needs to be unique

🚧Custom models can only use model_select

👍
Try it in colab

🚧
Training data limitations

📘
Training a custom router can take some time

🚧
Calling the trainer while another job is still running cancels the previous job

🚧
The custom provider/model string needs to be unique

🚧
Custom models can only use `model_select`