Reliability, fallbacks, and load-balancing

In this section we will learn how to use Not Diamond to improve the reliability and uptime of our application through the following methods:

  1. Falling back to a default model if Not Diamond fails to return a response
  2. Defining custom fallback logic for our router
  3. Leveraging Not Diamond's reliability and load-balancing toolkit for openai clients

Falling back to a default model if Not Diamond fails to return a response

Because Not Diamond is not a proxy, we can eliminate the risk of disruptions if Not Diamond ever fails to return a response. We can define a timeout for how many seconds to wait for a model recommendation from Not Diamond's API, and we can configure a fallback model as default in case of error or timeout. The default parameter is of type string and represents the specific model from the llm_providers list we want to use as a fallback.

result, session_id, provider = client.chat.completions.create(
    messages=[ 
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Consiely explain merge sort."}  # Adjust as desired
    ],
    model=['openai/gpt-3.5-turbo', 'openai/gpt-4o', 'anthropic/claude-3-5-sonnet-20240620']
    timeout=5,
    default="openai/gpt-4o-2024-05-13"
)
const result = await notDiamond.create({
  messages: [
    { role: 'system', content: 'You are a world class programmer.' },
    { role: 'user', content: 'Consiely explain merge sort.' }  // Adjust as desired
  ],
  llmProviders: [
    { provider: 'openai', model: 'gpt-3.5-turbo' },
    { provider: 'openai', model: 'gpt-4o' },
    { provider: 'anthropic', model: 'claude-3-5-sonnet-20240620' }
  ],
  timeout: 5,
  default: 'openai/gpt-4o'
});

The default value for timeout is 5 seconds. If no default LLM is defined, Not Diamond will automatically consider the first LLM specified in your list as the default model.

Custom routing fallback logic

If we want to use custom logic for defining fallbacks for our requests to specific LLMs, we can use Not Diamond to determine the best LLM to call using the model_select method and then decide how we want to implement our API call logic and fallback behavior.

session_id, provider = client.chat.completions.model_select(
    messages=[
        {"role": "system", "content": "You are a world class programmer."},
        {"role": "user", "content": "Write a merge sort in Python. Be as concise as possible."},
    ],
  	model=['openai/gpt-3.5-turbo', 'openai/gpt-4o', 'anthropic/claude-3-5-sonnet-20240620']
)

from openai import OpenAI

openai_client = OpenAI(api_key="OPENAI_API_KEY")

max_retries = 3

if provider.model == "gpt-3.5-turbo":
    for _ in range(max_retries):
        try:
            chat_completion = openai_client.chat.completions.create(
                messages=[
                    {
                        "role": "user",
                        "content": prompt_template.format(),
                    }
                ],
                model="gpt-3.5-turbo",
            )
            return chat_completion.choices[0].message.content
        except:
            continue
import { NotDiamond } from 'notdiamond';
import { OpenAI } from 'openai';
import dotenv from 'dotenv';
dotenv.config();

// Initialize the Not Diamond client
const notDiamond = new NotDiamond({apiKey: process.env.NOTDIAMOND_API_KEY});

// The best LLM is determined by Not Diamond based on the messages and specified models
const result = await notDiamond.modelSelect({
  messages: [
    { role: 'system', content: 'You are a world class programmer.' },
    { role: 'user', content: 'Consiely explain merge sort.' }  // Adjust as desired
  ],
  llmProviders: [
    { provider: 'openai', model: 'gpt-3.5-turbo' },
    { provider: 'openai', model: 'gpt-4o' },
    { provider: 'anthropic', model: 'claude-3-5-sonnet-20240620' }
  ],
  tradeoff: "cost"
});

if ('detail' in result) {
  console.error('Error:', result.detail);
} 
else {
  console.log('Not Diamond session ID:', result.session_id);  // A unique ID of Not Diamond's recommendation
  console.log('LLM called:', result.providers);  // The LLM routed to
    

    const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
    const maxRetries = 3;

    const provider = result.providers[0];

    let finalResult = null;

    if (provider.model === 'gpt-3.5-turbo') {
    for (let i = 0; i < maxRetries; i++) {
        try {
        const completion = await openai.chat.completions.create({
            messages: [
            {
                role: 'user',
                content: 'Write a merge sort in Python. Be as concise as possible.',
            }
            ],
            model: 'gpt-3.5-turbo',
        });
        finalResult = completion.choices[0];
        console.log('Response:', finalResult);
        break;
        } catch {
        continue;
        }
    }
    }
}

Reliability toolkit with notdiamond.init

Model providers may experience outages, return errors, or struggle to serve requests at the throughput we require. To help avoid downtime in our applications and effectively load-balance, Not Diamond offers a reliability toolkit via notdiamond.init which can be used via a simple one-line statement.

📘

More providers coming soon

At this time, the reliability toolkit is only available in our Python SDK and compatible with workflows which use OpenAI or AzureOpenAI clients (or the async versions).

If you would like to request support to our TypeScript SDK or other providers, please reach reach out to us and we'll work to accommodate your request.

Installation

Start by installing notdiamond alongside the openai extra:

pip install 'notdiamond[openai]'

If you have already installed notdiamond, please ensure you're using 0.3.34 or greater:

pyenv activate notdiamond-python
poetry version  # should show notdiamond 0.3.34 or greater
poetry show openai  # should show openai is installed

Usage

openai_client = OpenAI()
azure_client = AzureOpenAI()

init(
  client=[openai_client, azure_client],
  models=["azure/gpt-4o-mini", "openai/gpt-4o-mini", "azure/gpt-4o"],
  max_retries={
    'azure/gpt-4o-mini': 3,
    'openai/gpt-4o-mini': 1,
    "azure/gpt-4o": 1
  },
  timeout={
    'azure/gpt-4o-mini': 5.0,
    'openai/gpt-4o-mini': 5.0,
    "azure/gpt-4o", 10.0
  },
  model_messages={
    "azure/gpt-4o-mini": [{"role": "user", "content": "Respond to the question."}],
    "openai/gpt-4o-mini": [{"role": "user", "content": "Respond to the question."}],
    "azure/gpt-4o": [{"role": "user", "content": "Respond to the question as concisely as possible."}]
  },
  backoff={
    'azure/gpt-4o-mini': 1.0,
    'openai/gpt-4o-mini': 2.0,
    "azure/gpt-4o", 1.5,
  },
)

Let's walk through the keyword arguments of init:

  • client is either an OpenAI client or an iterable of them,
  • models defines the order in which to fall back to other models when any invocation fails
  • max_retries can be configured per-model (as shown above) or globally (using a single int)
  • timeout can be configured per-model (similar to max_retries) or globally using a float
  • model_messages accepts a map from model name to OpenAI-like messages, which will be appended to any model invoked by notdiamond.init
  • backoff can be configured to use an exponential backoff for each retried request, globally or per-model

Load balancing

We can also optionally configure init to load balance across various models and providers:

init(
  client=[openai_client, azure_client],
  models={
    "azure/gpt-4o-mini": 0.4,
    "openai/gpt-4o-mini": 0.4,
    "azure/gpt-4o": 0.2
  },
)

If the call to azure/gpt-4o fails, we will load balance fallback requests across azure/gpt-4o-mini and openai/gpt-4o-mini with equal probability. Of course, init will ignore the failed model (azure/gpt-4o) when load balancing.

notdiamond.init example

Imagine you have this simple workflow. It first prompts GPT-4o mini hosted on Microsoft Azure, then performs some other operations, and finishes by prompting GPT-4o mini hosted by OpenAI.

We'll introduce one wrinkle: our Azure client has an incorrect API key.

openai_client = OpenAI()
flaky_azure_client = AzureOpenAI(api_key="incorrect-api-key")

flaky_azure_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello there flaky client. Are you working?"}],
)

When attempting to execute this workflow, we will see a 401 authorization error:

openai.AuthenticationError: Error code: 401 - 
{
		'statusCode': 401, 
    'message': 'Unauthorized. Access token is missing, invalid, audience is incorrect (https://cognitiveservices.azure.com), or have expired.'
}

We could add error-handling to each LLM invocation in our application, but that introduces significant amounts of boilerplate to an otherwise-simple workflow. Instead, let's use notdiamond.init:

openai_client = OpenAI()
flaky_azure_client = AzureOpenAI(api_key="incorrect-api-key")

init(
  client=[openai_client, flaky_azure_client],
  models=["azure/gpt-4o-mini", "openai/gpt-4o-mini"],
  max_retries={
    'azure/gpt-4o-mini': 3,
    'openai/gpt-4o-mini': 1,
  }
)

print(
    "Azure fallback response: " +
    flaky_azure_client.chat.completions.create(
      model="gpt-4o-mini",
      messages=[{"role": "user", "content": "Hello there flaky client. Are you working?"}],
  ).choices[0].message.content
)

This workflow will now recover from the 401 by invoking openai/gpt-4o-mini.

notdiamond.toolkit._retry._RetryWrapperException: Failed to invoke ['azure/gpt-4o-mini']: 
openai.AuthenticationError: Error code: 401 - 
{
		'statusCode': 401, 
    'message': 'Unauthorized. Access token is missing, invalid, audience is incorrect (https://cognitiveservices.azure.com), or have expired.'
}

Azure fallback response: Hello! Yes, I'm here and ready to assist you. How can I help you today?

We've now successfully mitigated the risk of downtime in our application. For more information about notdiamond.init please see the API reference.