Key concepts
Before you start using prompt adaptation, familiarize yourself with these key concepts and requirements.
Golden examples
Golden examples (or "goldens") are reference data samples that include inputs and their expected correct outputs. These serve as ground truth for evaluating how well your prompts perform.
A golden example consists of:
- Fields - The input variables that populate your prompt template (e.g.,
question,context) - Answer - The expected correct output for this input
Example:
{
"fields": {
"question": "What is the capital of France?",
"context": "France is a country in Western Europe..."
},
"answer": "Paris"
}Goldens can be provided in either of the following two ways:
- All together via the
goldensparameter - Separated into
train_goldensandtest_goldens(recommended)
If submitting via the goldens parameter, our algorithm will leverage a small subset of the goldens as the train set and then use the entire dataset for final test-time evaluation.
Dataset requirements
Minimum samples: You must provide at least 25 golden examples for prompt adaptation to work effectively. If you have prototype_mode enabled, you can provide as few as 3 samples.
Maximum samples: You can submit up to 200 samples per request. Larger datasets result in longer processing times but may improve optimization quality.
Format: Each sample must include:
- A
fieldsdictionary with values for each variable in your prompt template - An
answerstring with the expected output
Fields
Fields are the named variables in your prompt template that get replaced with actual values. They must match between your template and golden data.
Example:
# Your template has two fields: {context} and {question}
prompt_template = """
Context: {context}
Question: {question}
"""
# Your fields list must match
fields = ["context", "question"]
# Each golden example must provide values for these fields
golden_data = [
{
"fields": {
"context": "...",
"question": "..."
},
"answer": "..."
}
]Origin model vs target models
Origin model: The LLM your current prompt was designed for. This is an optional parameter which will evaluate your original prompt agains your origin model as a baseline for comparison.
Target models: The LLM(s) you want to migrate to. You can specify up to 4 target models in a single request. If you would like to be able to specify more target models, please feel free to reach out to our team.
Example:
origin_model = {"provider": "google", "model": "gemini-2.5-pro"}
target_models = [
{"provider": "anthropic", "model": "claude-sonnet-4-5-20250929"},
{"provider": "openai", "model": "gpt-5-2025-08-07"}
]See Supported Models for the full list of available models.
Evaluation metrics
Evaluation metrics determine how prompt quality is measured during optimization. You must specify either:
Option 1: Use a predefined metric
"evaluation_metric": "LLMaaJ:Sem_Sim_1" # Default semantic similarityOption 2: Define a custom metric
"evaluation_config": {
"llm_judging_prompt": "Your custom judging prompt with {question} and {answer}",
"llm_judge": "openai/gpt-5-2025-08-07",
"correctness_cutoff": 0
}See Evaluation Metrics for detailed information on available metrics.
Job limits and processing time
Concurrency: Currently limited to 1 job per user at a time. You can include multiple target models in a single job. If you would like to be able to increase your concurrency limits, please feel free to reach out to our team.
Processing time: Jobs typically take several minutes to complete, depending on:
- Number of samples (25-200)
- Number of target models (1-4)
- Complexity of the prompts
You'll receive an adaptation_run_id immediately, which you can use to check job status and retrieve results once complete.
What you'll need
To use prompt adaptation, prepare:
- A Not Diamond API key
- Your current system prompt and prompt template
- At least 25 golden examples in the correct format
- Target model specifications
- An evaluation metric (or use the default)
Once you have these ready, proceed to the quickstart to make your first API call.
Updated 4 days ago
