DSPy is a framework for algorithmically optimizing LLM prompts and parameters. It treats prompt construction as a series of string operations that have parameters that can be optimized. It then optimizes those prompts using black-box optimization techniques and LLMs to modify the parameters of the prompt.

This tutorial largely follows DSPY's example for optimizing prompts to improve the performance of GPT-3.5-Turbo on the ScoNe dataset using the BootstrapFewShotWithRandomSearch compiler and Chain-of-Thought (CoT) prompt module. This process can be repeated for as many LLMs as we want to route between.

Initialization

pip install dspy-ai

Create a new file and import the following modules:

import dspy
from dspy import LM
from dspy.evaluate import Evaluate
from dspy.teleprompt import BootstrapFewShotWithRandomSearch
from notdiamond import NotDiamond, LLMConfig
from typing import Optional
import glob
import random
import os
import pandas as pd

To start, we will first define the LLM we want to optimize the prompt for. We will use openai/gpt-3.5-turbo as an example.

📘
Model support
DSPy supports OpenAI, Anyscale, Cohere, Together, and PremAI but you can also easily define a custom LLM client using the base LM class. In fact, you can straightforwardly create a custom LLM client using the notdiamond library.

class CustomLMClient(LM):
    def __init__(
        self, llm_config: LLMConfig, api_key: Optional[str] = None, **kwargs
    ):
        super().__init__(model=llm_config.model, **kwargs)
        self.llm_config = llm_config
        self.client = NotDiamond(api_key=api_key)

        self.history = []

    def basic_request(self, prompt: str, **kwargs):
        result, _, _ = self.client.chat.completions.create(
            messages=[
              {
                "role": "user", 
                "content": prompt.replace("{", "{{").replace("}", "}}")
              }
            ], 
            model=[self.llm_config]
        )
        
        self.history.append({
            "prompt": prompt,
            "response": result.content,
            "kwargs": kwargs
        })
        return result.content

    def __call__(self, prompt, only_completed=True, return_sorted=False, **kwargs):
        response = self.request(prompt, **kwargs)
        return [response]
      
      
llm_config = LLMConfig(
    provider="openai",
    model="gpt-3.5-turbo",
    api_key=os.environ['OPENAI_API_KEY']
)

lm = CustomLMClient(llm_config=llm_config)

dspy.settings.configure(lm=lm)

Dataset

First download the dataset

git clone https://github.com/selenashe/ScoNe.git

Define the dataloader

def load_scone(dirname):
    dfs = []
    for filename in glob.glob(dirname + "/*.csv"):
        df = pd.read_csv(filename, index_col=0)
        df['category'] = os.path.basename(filename).replace(".csv", "")
        dfs.append(df)
    data_df = pd.concat(dfs)

    def as_example(row):
        # The 'one_scoped' file is from an earlier dataset, MoNLI, and
        # so is formatted a bit differently:
        suffix = '' if row['category'] == 'one_scoped' else '_edited'
        # Reformat the hypothesis to be an embedded clause in a question:
        hkey = 'sentence2' + suffix
        question = row[hkey][0].lower() + row[hkey][1: ].strip(".")
        question = f"Can we logically conclude for sure that {question}?"
        # Binary task formulation:
        label = "Yes" if row['gold_label' + suffix] == 'entailment' else "No"
        return dspy.Example({
            "context": row['sentence1' + suffix],
            "question": question,
            "answer": label,
            "category": row['category']
        }).with_inputs("context", "question")

    return list(data_df.apply(as_example, axis=1).values)

Create train, dev, and test splits

all_train = load_scone("ScoNe/scone_nli/train")
test = load_scone(dirname=f"ScoNe/scone_nli/test")

random.seed(1)
random.shuffle(all_train)

# 200 random train, 50 random dev, 200 test:
train, dev = all_train[: 200], all_train[200: 250]
test = [ex for ex in test if ex.category == "one_scoped"]

len(train), len(dev), len(test)

Evaluation tools

🚧
Running prompt optimization will incur inference costs
Remember that running prompt optimization will incur inference costs based on the number of iterated LLM calls.

The compiler needs an objective to optimize for. Here we will use the built in answer_exact_match metric as an example

scone_accuracy = dspy.evaluate.metrics.answer_exact_match
evaluator = Evaluate(devset=test, num_threads=1, display_progress=True, display_table=0)

Baseline: CoT zero-shot

First, we will construct a DSPy programme using CoT zero-shot (no examples). This will act as a baseline performance of gpt-3.5-turbo on the dataset. It will also allow us to familiarize with key concepts in DSPy; signatures and modules.

We begin by defining the signature for this DSPy programme.

class ScoNeSignature(dspy.Signature):
    ("""You are given some context (a premise) and a question (a hypothesis). """
    """You must indicate with Yes/No answer whether we can logically """
    """conclude the hypothesis from the premise.""")

    context = dspy.InputField()
    question = dspy.InputField()
    answer = dspy.OutputField(desc="Yes or No")

A DSPy Signature is the most basic form of task description which simply requires inputs and outputs and optionally, a small description about them and the task too. There are 2 ways to define a Signature: Inline and Class-Based. Here we are using the Class-Based method.

Next, we define the DSPy programme module.

class ScoNeCoT(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate_answer = dspy.ChainOfThought(ScoNeSignature)

    def forward(self, context, question):
        return self.generate_answer(context=context, question=question)

If you have used PyTorch before, this should look pretty familiar. A DSPy module is a building block for defining your LLM programme. It can be run by itself, chained with other modules, and inherit other modules, allowing you to fully compose your LLM programme as you wish.

You can run this zero-shot CoT programme to see how gpt-3.5-turbo performs without any optimization

cot_zeroshot = ScoNeCoT()
evaluator(cot_zeroshot, metric=scone_accuracy)

"""
Average Metric: 100/200 (50%)
"""

Clearly gpt-3.5-turbo is pretty bad at zero-shot reasoning about negation.

Optimized few-shot with bootstrapped demonstrations

To optimize the CoT prompt, we will use the BootstrapFewShotWithRandomSearch compiler. It essentially uses an LLM (gpt-4-turbo in this case) to generate a number of prompts as initialization and then use random search to find the best one.

gpt4T = dspy.OpenAI(model='gpt-4-turbo', max_tokens=350, model_type='chat')

bootstrap_optimizer = BootstrapFewShotWithRandomSearch(
    max_bootstrapped_demos=8,
    max_labeled_demos=8,
    num_candidate_programs=10,
    num_threads=8,
    metric=scone_accuracy,
    teacher_settings=dict(lm=gpt4T))

Run the compiler:

cot_fewshot = bootstrap_optimizer.compile(cot_zeroshot, trainset=train, valset=dev)

evaluator(cot_fewshot, metric=scone_accuracy)

"""
Average Metric: 171/200 (85.5%)
"""

You can then view the optimized prompts via

lm.inspect_history(n=1)

Wrapping up

In this example, we showed how to optimize prompts for GPT-3.5 on ScoNe. We can run the same process for any other LLM simply by swapping it in. With our optimized prompts and evaluation scores for each LLM, we can then train a custom router that always calls the best (prompt, model) combination.

DSPy tutorial

Initialization

📘
Model support

Dataset

Evaluation tools

🚧
Running prompt optimization will incur inference costs

Baseline: CoT zero-shot

Optimized few-shot with bootstrapped demonstrations

Wrapping up

Initialization

📘Model support

Dataset

Evaluation tools

🚧Running prompt optimization will incur inference costs

Baseline: CoT zero-shot

Optimized few-shot with bootstrapped demonstrations

Wrapping up

📘
Model support

🚧
Running prompt optimization will incur inference costs