DSPy tutorial
DSPy is a framework for algorithmically optimizing LLM prompts and parameters. It treats prompt construction as a series of string operations that have parameters that can be optimized. It then optimizes those prompts using black-box optimization techniques and LLMs to modify the parameters of the prompt.
This tutorial largely follows DSPY's example for optimizing prompts to improve the performance of GPT-3.5-Turbo on the ScoNe dataset using the BootstrapFewShotWithRandomSearch
compiler and Chain-of-Thought (CoT) prompt module. This process can be repeated for as many LLMs as we want to route between.
Initialization
pip install dspy-ai
Create a new file and import the following modules:
import dspy
from dspy import LM
from dspy.evaluate import Evaluate
from dspy.teleprompt import BootstrapFewShotWithRandomSearch
from notdiamond import NotDiamond, LLMConfig
from typing import Optional
import glob
import random
import os
import pandas as pd
To start, we will first define the LLM we want to optimize the prompt for. We will use openai/gpt-3.5-turbo
as an example.
Model support
DSPy supports OpenAI, Anyscale, Cohere, Together, and PremAI but you can also easily define a custom LLM client using the base
LM
class. In fact, you can straightforwardly create a custom LLM client using thenotdiamond
library.
class CustomLMClient(LM):
def __init__(
self, llm_config: LLMConfig, api_key: Optional[str] = None, **kwargs
):
super().__init__(model=llm_config.model, **kwargs)
self.llm_config = llm_config
self.client = NotDiamond(api_key=api_key)
self.history = []
def basic_request(self, prompt: str, **kwargs):
result, _, _ = self.client.chat.completions.create(
messages=[
{
"role": "user",
"content": prompt.replace("{", "{{").replace("}", "}}")
}
],
model=[self.llm_config]
)
self.history.append({
"prompt": prompt,
"response": result.content,
"kwargs": kwargs
})
return result.content
def __call__(self, prompt, only_completed=True, return_sorted=False, **kwargs):
response = self.request(prompt, **kwargs)
return [response]
llm_config = LLMConfig(
provider="openai",
model="gpt-3.5-turbo",
api_key=os.environ['OPENAI_API_KEY']
)
lm = CustomLMClient(llm_config=llm_config)
dspy.settings.configure(lm=lm)
Dataset
First download the dataset
git clone https://github.com/selenashe/ScoNe.git
Define the dataloader
def load_scone(dirname):
dfs = []
for filename in glob.glob(dirname + "/*.csv"):
df = pd.read_csv(filename, index_col=0)
df['category'] = os.path.basename(filename).replace(".csv", "")
dfs.append(df)
data_df = pd.concat(dfs)
def as_example(row):
# The 'one_scoped' file is from an earlier dataset, MoNLI, and
# so is formatted a bit differently:
suffix = '' if row['category'] == 'one_scoped' else '_edited'
# Reformat the hypothesis to be an embedded clause in a question:
hkey = 'sentence2' + suffix
question = row[hkey][0].lower() + row[hkey][1: ].strip(".")
question = f"Can we logically conclude for sure that {question}?"
# Binary task formulation:
label = "Yes" if row['gold_label' + suffix] == 'entailment' else "No"
return dspy.Example({
"context": row['sentence1' + suffix],
"question": question,
"answer": label,
"category": row['category']
}).with_inputs("context", "question")
return list(data_df.apply(as_example, axis=1).values)
Create train, dev, and test splits
all_train = load_scone("ScoNe/scone_nli/train")
test = load_scone(dirname=f"ScoNe/scone_nli/test")
random.seed(1)
random.shuffle(all_train)
# 200 random train, 50 random dev, 200 test:
train, dev = all_train[: 200], all_train[200: 250]
test = [ex for ex in test if ex.category == "one_scoped"]
len(train), len(dev), len(test)
Evaluation tools
Running prompt optimization will incur inference costs
Remember that running prompt optimization will incur inference costs based on the number of iterated LLM calls.
The compiler needs an objective to optimize for. Here we will use the built in answer_exact_match
metric as an example
scone_accuracy = dspy.evaluate.metrics.answer_exact_match
evaluator = Evaluate(devset=test, num_threads=1, display_progress=True, display_table=0)
Baseline: CoT zero-shot
First, we will construct a DSPy programme using CoT zero-shot (no examples). This will act as a baseline performance of gpt-3.5-turbo
on the dataset. It will also allow us to familiarize with key concepts in DSPy; signatures and modules.
We begin by defining the signature for this DSPy programme.
class ScoNeSignature(dspy.Signature):
("""You are given some context (a premise) and a question (a hypothesis). """
"""You must indicate with Yes/No answer whether we can logically """
"""conclude the hypothesis from the premise.""")
context = dspy.InputField()
question = dspy.InputField()
answer = dspy.OutputField(desc="Yes or No")
A DSPy Signature is the most basic form of task description which simply requires inputs and outputs and optionally, a small description about them and the task too. There are 2 ways to define a Signature: Inline and Class-Based. Here we are using the Class-Based method.
Next, we define the DSPy programme module.
class ScoNeCoT(dspy.Module):
def __init__(self):
super().__init__()
self.generate_answer = dspy.ChainOfThought(ScoNeSignature)
def forward(self, context, question):
return self.generate_answer(context=context, question=question)
If you have used PyTorch before, this should look pretty familiar. A DSPy module is a building block for defining your LLM programme. It can be run by itself, chained with other modules, and inherit other modules, allowing you to fully compose your LLM programme as you wish.
You can run this zero-shot CoT programme to see how gpt-3.5-turbo
performs without any optimization
cot_zeroshot = ScoNeCoT()
evaluator(cot_zeroshot, metric=scone_accuracy)
"""
Average Metric: 100/200 (50%)
"""
Clearly gpt-3.5-turbo
is pretty bad at zero-shot reasoning about negation.
Optimized few-shot with bootstrapped demonstrations
To optimize the CoT prompt, we will use the BootstrapFewShotWithRandomSearch
compiler. It essentially uses an LLM (gpt-4-turbo
in this case) to generate a number of prompts as initialization and then use random search to find the best one.
gpt4T = dspy.OpenAI(model='gpt-4-turbo', max_tokens=350, model_type='chat')
bootstrap_optimizer = BootstrapFewShotWithRandomSearch(
max_bootstrapped_demos=8,
max_labeled_demos=8,
num_candidate_programs=10,
num_threads=8,
metric=scone_accuracy,
teacher_settings=dict(lm=gpt4T))
Run the compiler:
cot_fewshot = bootstrap_optimizer.compile(cot_zeroshot, trainset=train, valset=dev)
evaluator(cot_fewshot, metric=scone_accuracy)
"""
Average Metric: 171/200 (85.5%)
"""
You can then view the optimized prompts via
lm.inspect_history(n=1)
Wrapping up
In this example, we showed how to optimize prompts for GPT-3.5 on ScoNe. We can run the same process for any other LLM simply by swapping it in. With our optimized prompts and evaluation scores for each LLM, we can then train a custom router that always calls the best (prompt, model) combination.
Updated 6 months ago