Cost and latency tradeoffs

By default, Not Diamond maximizes quality above all else. However, we can also define explicit cost and latency tradeoffs as a parameter to optimize for speed or cost-savings:

result, session_id, provider = client.chat.completions.create(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Concisely explain merge sort."}
    ],
    model=['openai/gpt-4o', 'openai/gpt-3.5-turbo'],
    tradeoff="cost"
)
const result = await notDiamond.create({
  messages: [
    { role: 'system', content: 'You are a world class programmer.' },
    { role: 'user', content: 'Consiely explain merge sort.' },
  ],
  llmProviders: llmProviders,
  tradeoff: 'cost', // Consider cheaper models when quality loss is negligible
});

When tradeoff="cost", Not Diamond will automatically determine when a query is simple enough to use a cheaper model without degrading the quality of the response. Equivalently, tradeoff="latency" will consider faster models. When no tradeoff is defined, Not Diamond will simply maximize output quality with no consideration given to cost or latency.

You can also define custom cost and latency attributes for specific models to help inform how tradeoffs are made

📘

More fine-grained tradeoffs available

For ease of use, our SDKs offer binary cost and latency tradeoff parameters that conservatively maintain high response quality. If you need to define more fine-grained Pareto optimal cost or latency tradeoffs, reach out to us and we can support your needs.