Evaluation metrics
Our prompt adaptation tool will optimize prompts against one of several possible metric parameters.
LLM-as-a-judge metrics
Semantic similarity
"LLMaaJ:Sem_Sim_1": Uses an LLM-as-a-judge to evaluate the semantic similarity of the model response relative to the reference answer. Outputs a binary score (0 or 1) based on whether the answers are semantically similar. This is the default metric.
The prompt for this metric can be found below. By default, we use openai/gpt-4o-2024-08-06 as the judge.
Given the predicted answer and reference answer, compare them and check whether they mean the same.
Following are the given answers:
Predicted Answer: {predicted_answer}
Reference Answer: {gt_answer}
On a NEW LINE, give a score of 1 if the answers mean the same, and 0 if they differ, and nothing more.📓 Example notebook: nd_llmaaj.ipynb
RAG-specific metrics
For retrieval-augmented generation (RAG) applications, Not Diamond supports several Ragas metrics:
Faithfulness
"RAGAS_FAITHFULNESS": Measures how faithful a response is to the provided context. Ensures the LLM doesn't hallucinate or introduce information not present in the retrieved context. When using this metric, ensure that you use"question"as the field value for user input, and"context"as the field value for retrieved context. Currently, we only support 1 string for context. If you have retrieved multiple contexts, you can concatenate them into a single string. See following example
fields = ["question", "context"]
pa_ds = [
{
"fields": {
"question": "User input",
"context": "Context retrieved based on user input"
},
"answer": "Reference answer. This is not used in RAGAS_FAITHFULNESS"
}
for sample in ds
]Learn more: Ragas Faithfulness Documentation
📓 Example notebook: nd_ragas_faithfulness.ipynb
Answer Relevance
"RAGAS_RELEVANCE": Evaluates how relevant a response is to the given question. Penalizes responses that are incomplete or contain unnecessary information.
Learn more: Ragas Answer Relevance Documentation
📓 Example notebook: nd_ragas_answerrelevance.ipynb
Custom LLM-as-a-judge metrics
If you have a custom LLM-as-a-judge metric you would like to use, you can specify an "evaluation_config" instead of an "evaluation_metric". The "evaluation_config" should consist of the following:
llm_judging_prompt: The custom prompt for the LLM judge. The prompt must contain a{question}and an{answer}field. The{question}field is used to insert the formatted prompts from your dataset and the{answer}field is used to insert the LLM's response to the question.llm_judge: The LLM judge to use for evaluation, in"provider/model"format. The list of supported LLMs can be found in Prompt Adaptation Models.correctness_cutoff: The cutoff score above which the response is deemed correct. For example, if the judging score is from 1 to 10, you might set the cutoff at 8.
An example of this is shown below
evaluation_config = {
"llm_judging_prompt": (
"Does the assistant's answer properly answer the user's question? question: {question} answer: {answer} "
"Score a 1 if the answer is correct, 0 otherwise. Do not output any other values or text - only the score."
),
"llm_judge": "openai/gpt-5-2025-08-07",
"correctness_cutoff": 0,
}📓 Example notebook: nd_custom_llmaaj.ipynb
Computed metrics
JSON Match
"JSON_Match": Determines if the LLM's JSON output matches the golden JSON output. Computesprecision,recall, andf1scores based on individual fields in the JSON, averaged across all samples. Useful for applications requiring structured outputs.
📓 Example notebook: nd_json_match.ipynb
Exact Match
"EXACT_MATCH": Checks if the response is exactly the same as the reference text, returning 1 for an exact match and 0 otherwise.
📓 Example notebook: nd_exactmatch.ipynb
BLEU
"BLEU": Measures similarity between response and reference based on n-gram precision and brevity penalty. Originally designed for machine translation evaluation, BLEU scores range from 0 to 1, where 1 indicates a perfect match.
Learn more: Ragas BLEU Score Documentation
📓 Example notebook: nd_ragas_bleu.ipynb
ROUGE
"ROUGE": Evaluates natural language generation quality by measuring n-gram overlap between response and reference. Can be configured for different modes (rouge1, rougeL) and measures (precision, recall, F1). Scores range from 0 to 1.
Learn more: Ragas ROUGE Score Documentation
📓 Example notebook: nd_ragas_rouge.ipynb
METEOR
"METEOR": Evaluates response quality using a combination of precision, recall, and synonym matching. Better handles morphological variations and paraphrasing compared to BLEU. Particularly useful for languages with rich morphology.
Learn more: HuggingFace METEOR Documentation
📓 Example notebook: nd_meteor.ipynb
Using evaluation metrics
Here's how to specify different evaluation metrics when requesting prompt adaptation:
Set your API key
export NOTDIAMOND_API_KEY=YOUR_NOTDIAMOND_API_KEYimport os
from notdiamond import NotDiamond
client = NotDiamond(api_key=os.environ.get("NOTDIAMOND_API_KEY"))
response = client.prompt_adaptation.adapt(
system_prompt=system_prompt,
template=prompt_template,
fields=fields,
goldens=golden_data,
origin_model=origin_model,
target_models=target_models,
evaluation_metric="EXACT_MATCH" # Use exact match for classification tasks
)import NotDiamond from 'notdiamond';
const client = new NotDiamond({api_key: process.env.NOTDIAMOND_API_KEY});
const response = await client.promptAdaptation.adapt({
systemPrompt: systemPrompt,
template: promptTemplate,
fields: fields,
goldens: goldenData,
originModel: originModel,
targetModels: targetModels,
evaluationMetric: 'EXACT_MATCH' // Use exact match for classification tasks
});Updated 4 days ago
