Evaluation metrics

Our prompt adaptation tool will optimize prompts against one of several possible metric parameters.

LLM-as-a-judge metrics

Semantic similarity

  • "LLMaaJ:Sem_Sim_1": Uses an LLM-as-a-judge to evaluate the semantic similarity of the model response relative to the reference answer. Outputs a binary score (0 or 1) based on whether the answers are semantically similar. This is the default metric.

The prompt for this metric can be found below. By default, we use openai/gpt-4o-2024-08-06 as the judge.

Given the predicted answer and reference answer, compare them and check whether they mean the same.

Following are the given answers:

Predicted Answer: {predicted_answer}
Reference Answer: {gt_answer}

On a NEW LINE, give a score of 1 if the answers mean the same, and 0 if they differ, and nothing more.

📓 Example notebook: nd_llmaaj.ipynb

RAG-specific metrics

For retrieval-augmented generation (RAG) applications, Not Diamond supports several Ragas metrics:

Faithfulness

  • "RAGAS_FAITHFULNESS": Measures how faithful a response is to the provided context. Ensures the LLM doesn't hallucinate or introduce information not present in the retrieved context. When using this metric, ensure that you use "question" as the field value for user input, and "context" as the field value for retrieved context. Currently, we only support 1 string for context. If you have retrieved multiple contexts, you can concatenate them into a single string. See following example
fields = ["question", "context"]

pa_ds = [
    {
        "fields": {
            "question": "User input",
            "context": "Context retrieved based on user input"
        },
        "answer": "Reference answer. This is not used in RAGAS_FAITHFULNESS"
    }
    for sample in ds
]

Learn more: Ragas Faithfulness Documentation

📓 Example notebook: nd_ragas_faithfulness.ipynb

Answer Relevance

  • "RAGAS_RELEVANCE": Evaluates how relevant a response is to the given question. Penalizes responses that are incomplete or contain unnecessary information.

Learn more: Ragas Answer Relevance Documentation

📓 Example notebook: nd_ragas_answerrelevance.ipynb

Custom LLM-as-a-judge metrics

If you have a custom LLM-as-a-judge metric you would like to use, you can specify an "evaluation_config" instead of an "evaluation_metric". The "evaluation_config" should consist of the following:

  • llm_judging_prompt: The custom prompt for the LLM judge. The prompt must contain a{question} and an {answer} field. The {question} field is used to insert the formatted prompts from your dataset and the {answer} field is used to insert the LLM's response to the question.
  • llm_judge: The LLM judge to use for evaluation, in "provider/model" format. The list of supported LLMs can be found in Prompt Adaptation Models.
  • correctness_cutoff: The cutoff score above which the response is deemed correct. For example, if the judging score is from 1 to 10, you might set the cutoff at 8.

An example of this is shown below

evaluation_config = {
    "llm_judging_prompt": (
        "Does the assistant's answer properly answer the user's question? question: {question} answer: {answer} "
        "Score a 1 if the answer is correct, 0 otherwise. Do not output any other values or text - only the score."
        ),
    "llm_judge": "openai/gpt-5-2025-08-07",
    "correctness_cutoff": 0,
}

📓 Example notebook: nd_custom_llmaaj.ipynb

Computed metrics

JSON Match

  • "JSON_Match": Determines if the LLM's JSON output matches the golden JSON output. Computes precision, recall, and f1 scores based on individual fields in the JSON, averaged across all samples. Useful for applications requiring structured outputs.

📓 Example notebook: nd_json_match.ipynb

Exact Match

  • "EXACT_MATCH": Checks if the response is exactly the same as the reference text, returning 1 for an exact match and 0 otherwise.

📓 Example notebook: nd_exactmatch.ipynb

BLEU

  • "BLEU": Measures similarity between response and reference based on n-gram precision and brevity penalty. Originally designed for machine translation evaluation, BLEU scores range from 0 to 1, where 1 indicates a perfect match.

Learn more: Ragas BLEU Score Documentation

📓 Example notebook: nd_ragas_bleu.ipynb

ROUGE

  • "ROUGE": Evaluates natural language generation quality by measuring n-gram overlap between response and reference. Can be configured for different modes (rouge1, rougeL) and measures (precision, recall, F1). Scores range from 0 to 1.

Learn more: Ragas ROUGE Score Documentation

📓 Example notebook: nd_ragas_rouge.ipynb

METEOR

  • "METEOR": Evaluates response quality using a combination of precision, recall, and synonym matching. Better handles morphological variations and paraphrasing compared to BLEU. Particularly useful for languages with rich morphology.

Learn more: HuggingFace METEOR Documentation

📓 Example notebook: nd_meteor.ipynb

Using evaluation metrics

Here's how to specify different evaluation metrics when requesting prompt adaptation:

Set your API key

export NOTDIAMOND_API_KEY=YOUR_NOTDIAMOND_API_KEY
import os

from notdiamond import NotDiamond

client = NotDiamond(api_key=os.environ.get("NOTDIAMOND_API_KEY"))

response = client.prompt_adaptation.adapt(
    system_prompt=system_prompt,
    template=prompt_template,
    fields=fields,
    goldens=golden_data,
    origin_model=origin_model,
    target_models=target_models,
    evaluation_metric="EXACT_MATCH"  # Use exact match for classification tasks
)
import NotDiamond from 'notdiamond';

const client = new NotDiamond({api_key: process.env.NOTDIAMOND_API_KEY});

const response = await client.promptAdaptation.adapt({
  systemPrompt: systemPrompt,
  template: promptTemplate,
  fields: fields,
  goldens: goldenData,
  originModel: originModel,
  targetModels: targetModels,
  evaluationMetric: 'EXACT_MATCH'  // Use exact match for classification tasks
});