LLMs are proficient on about 80% of tasks—it’s the last 20% that often prevents a project from reaching production. The “human-in-the-loop” approach is designed to close this gap by leveraging a human expert to review LLM outputs. But how do you determine when to use human verification? Requiring humans to review millions or billions of outputs a day is not very efficient, and LLMs themselves are notoriously bad at identifying when they have made a mistake. It's also not a good experience for end users if they're forced to decipher when an LLM is misleading them and then request human assistance.

To solve this problem, we need to be able to accurately predict exactly when a query can be answered by an LLM and when it should go to a human expert.

In this tutorial, we will use Not Diamond to build a custom router that determines when a query can be answered by an LLM and when it should go to a human. We’ll then use LangGraph to to build an app that routes queries between LLMs and humans. You can follow along with the example below or in this notebook.

Train a custom router

To build a custom router, we first need data on the LLMs we are interested in routing between as well as define a custom model that represents a human.

For the purpose of this example, we’ll use a prepared dataset, which has evaluation results of the following models on the GPQA dataset.

claude-3-5-sonnet-20240620
claude-3-opus-20240229
gpt-4o-2024-05-13
gpt-4-turbo-2024-04-09
gemini-1.5-pro-latest

!curl -L "https://drive.google.com/uc?export=download&id=1MWyjjaZsgnp8xDQ0q7chKaT7HxFJlCT1" -o gpqa.csv
!apt install graphviz libgraphviz-dev pkg-config
!pip install -q notdiamond[create] pandas langgraph langchain-anthropic langchain-openai langchain-google-genai pygraphviz --upgrade

import os

os.environ["NOTDIAMOND_API_KEY"] = 'YOUR_NOTDIAMOND_API_KEY'
os.environ["OPENAI_API_KEY"] = 'YOUR_OPENAI_API_KEY'
os.environ["ANTHROPIC_API_KEY"] = 'YOUR_ANTHROPIC_API_KEY'
os.environ["GOOGLE_API_KEY"] = 'YOUR_GOOGLE_API_KEY'

from IPython.display import display
import pandas as pd
from pprint import pprint
from notdiamond.toolkit import CustomRouter
from notdiamond import NotDiamond

# Load the dataset
df = pd.read_csv('gpqa.csv')

# Display the column names to ensure it's loaded correctly
pprint(df.columns)

Dataset preparation

Inside the dataset, each LLM has a response and final_score column, recording the response of the LLM to the Input and the score of the response. Scores are given as a binary, 0 for wrong answers, and 1 for correct answers.

We will first define a human as a custom model to the custom router

from notdiamond.llms.config import LLMConfig

human = LLMConfig(
    provider="hr",
    model="human",
    is_custom=True,
    context_length=100000000,
    input_price=1e6, # USD per million tokens
    output_price=1e6, # USD per million tokens
    latency=1e6 # time to first token (seconds)
)

Here, we’ve used the LLMConfig class to define a custom model by setting is_custom=True. When defining a custom model, we need to tell Not Diamond the context_length, input_price, output_price, and latency of the model. Since we’re defining a human, we can set these values to be large.

Next, we’ll give the human model a score on the Input of the dataset. Since we don’t actually have human written responses, we can simply give the human model a score of 1 if the most performant models in the dataset did not respond to a query correctly. In essence, this tells the router that when a query is hard, send it to a human.

top_providers = [
    "openai/gpt-4o-2024-05-13",
    "openai/gpt-4-turbo-2024-04-09",
    "anthropic/claude-3-5-sonnet-20240620",
]

def create_human_score(row):
    llm_scores = []
    for col, val in row.items():
        if "final_score" in col and any([x in col for x in top_providers]):
            llm_scores.append(val)
    if sum(llm_scores) < len(top_providers):
        return 1.
    else:
        return 0.

def create_human_response(row):
    return "Your question will be directed to the next available agent."

df[f"{human.provider}/{human.model}/final_score"] = df.apply(create_human_score, axis=1)
df[f"{human.provider}/{human.model}/response"] = df.apply(create_human_response, axis=1)

pprint(df.columns)

Finally, we will process the dataset into a dictionary of pandas.DataFrame because that’s what the notdiamond.toolkit.CustomRouter expects

# List of LLM providers
llm_providers = [
    "openai/gpt-4o-2024-05-13",
    "openai/gpt-4-turbo-2024-04-09",
    "google/gemini-1.5-pro-latest",
    "anthropic/claude-3-opus-20240229",
    "anthropic/claude-3-5-sonnet-20240620",
    human
]

# Dictionaries to hold train and test splits for each provider
pzn_train = {}
pzn_test = {}

# Separating the data and creating the splits
for provider in llm_providers:
    provider_results = df.filter(
        ["Input", f"{str(provider)}/response", f"{str(provider)}/final_score"], axis=1
    )
    provider_results.rename(
        columns={
            f"{str(provider)}/response": "response",
            f"{str(provider)}/final_score": "score"
        },
        inplace=True
    )

    # Create train/test split
    train = provider_results.sample(frac=0.9, random_state=42)
    test = provider_results.drop(train.index)

    pzn_train[provider] = train
    pzn_test[provider] = test

# Display the number of samples in each split for the first provider as a sanity check
provider = llm_providers[0]
pprint(f"Train samples: {len(pzn_train[provider])}")
pprint(f"Test samples: {len(pzn_test[provider])}")

Train the custom router

To train the custom router, just call the CustomRouter.train method using the dataset we have prepared

# Initialize the CustomRouter object for training
trainer = CustomRouter(
    language="english",
    maximize=True,  # Indicate if higher scores are better (setting to False indicates the opposite)
)

# Train the model using your dataset
preference_id = trainer.fit(
    dataset=pzn_train,  # The dataset containing inputs, responses, and scores
    prompt_column="Input",  # Column name for the input prompts
    response_column="response",  # Column name for the model responses
    score_column="score"  # Column name for the scores
)
print(preference_id)

Note that this will return a preference ID preference_id which we will use later to make API requests to Not Diamond. The training takes a couple of minutes and you can track the training status in the Not Diamond Dashboard.

Evaluate the custom router

Once the custom router is trained, you can evaluate it on the test set we prepared earlier

# Evaluate the custom router using the test dataset
results = trainer.eval(
    dataset=pzn_test,
    prompt_column="Input",
    response_column="response",
    score_column="score",
    preference_id=preference_id
)

# Split the results into eval_results and eval_stats
eval_results, eval_stats = results

# Print the evaluation results and statistics
display(eval_results)
display(eval_stats)

As we can see, using Not Diamond we achieve human grade performance while also sending a significant volume of queries to LLMs.

Human-in-the-loop

With the custom router trained, we can now build our human-in-the-loop app. We will build a simple graph, where the input query goes to our custom router to determine the best LLM or human to call.

First we'll define all the nodes and our router method

from langgraph.graph import MessagesState, StateGraph, START, END
from langgraph.checkpoint.memory import MemorySaver
from langchain_core.messages import HumanMessage, AIMessage

from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_anthropic import ChatAnthropic
from langchain_openai import ChatOpenAI

from notdiamond import NotDiamond


gemini_15_pro = ChatGoogleGenerativeAI(model="gemini-1.5-pro")
claude_3_opus = ChatAnthropic(model="claude-3-opus-20240229")
claude_35_sonnet = ChatAnthropic(model="claude-3-5-sonnet-20240620")
gpt_4_turbo = ChatOpenAI(model="gpt-4-turbo-2024-04-09")
gpt_4o = ChatOpenAI(model="gpt-4o-2024-05-13")
notdiamond = NotDiamond()


def router(state):
    session_id, provider = notdiamond.model_select(
        messages=[{"role": "user", "content": state["messages"][-1].content}],
        model=llm_providers,
        preference_id=preference_id
    )
    return provider.model


def ask_gpt_4o(state):
    response = gpt_4o.invoke(
        state["messages"]
    )
    response.content = f"This question was answered by gpt-4o-2024-05-13:\n\n{response.content}"
    return {"messages": [response]}


def ask_gpt_4_turbo(state):
    response = gpt_4_turbo.invoke(
        state["messages"]
    )
    response.content = f"This question was answered by gpt-4-turbo-2024-04-09:\n\n{response.content}"
    return {"messages": [response]}


def ask_claude_3_opus(state):
    response = claude_3_opus.invoke(
        state["messages"]
    )
    response.content = f"This question was answered by claude-3-opus-20240229:\n\n{response.content}"
    return {"messages": [response]}


def ask_claude_3_5_sonnet(state):
    response = claude_35_sonnet.invoke(
        state["messages"]
    )
    response.content = f"This question was answered by claude-3-5-sonnet-20240620:\n\n{response.content}"
    return {"messages": [response]}


def ask_gemini_15_pro(state):
    response = gemini_15_pro.invoke(
        state["messages"]
    )
    response.content = f"This question was answered by gemini-1.5-pro-latest:\n\n{response.content}"
    return {"messages": [response]}


def ask_human(state):
    pass

Next, connect the nodes and define the conditional edge

workflow = StateGraph(MessagesState)

workflow.add_node("ask_human", ask_human)
workflow.add_node("ask_gemini_15_pro", ask_gemini_15_pro)
workflow.add_node("ask_gpt_4o", ask_gpt_4o)
workflow.add_node("ask_gpt_4_turbo", ask_gpt_4_turbo)
workflow.add_node("ask_claude_3_opus", ask_claude_3_opus)
workflow.add_node("ask_claude_3_5_sonnet", ask_claude_3_5_sonnet)

workflow.add_conditional_edges(
    START,
    router,
    {
        "human": "ask_human",
        "gemini-1.5-pro-latest": "ask_gemini_15_pro",
        "gpt-4o-2024-05-13": "ask_gpt_4o",
        "gpt-4-turbo-2024-04-09": "ask_gpt_4_turbo",
        "claude-3-opus-20240229": "ask_claude_3_opus",
        "claude-3-5-sonnet-20240620": "ask_claude_3_5_sonnet"
    }
)
workflow.add_edge("ask_human", END)
workflow.add_edge("ask_gemini_15_pro", END)
workflow.add_edge("ask_gpt_4o", END)
workflow.add_edge("ask_gpt_4_turbo", END)
workflow.add_edge("ask_claude_3_opus", END)
workflow.add_edge("ask_claude_3_5_sonnet", END)

# Set up memory
memory = MemorySaver()

# Compile app
app = workflow.compile(checkpointer=memory, interrupt_before=["ask_human"])
display(Image(app.get_graph().draw_png()))

Finally, run the app!

config = {"configurable": {"thread_id": "1"}}

input = """Astronomers are searching for exoplanets around two stars with exactly the same masses. Using the RV method, they detected one planet around each star, both with masses similar to that of Neptune. The stars themselves have masses similar to that of our Sun. Both planets were found to be in circular orbits.

Planet #1 was detected from the up to 5 miliangstrom periodic shift of a spectral line at a given wavelength. The periodic wavelength shift of the same spectral line in the spectrum of the host of planet #2 was 7 miliangstrom.

The question is: How many times is the orbital period of planet #2 longer than that of planet #1?
Choices:
(A) ~ 0.36
(B) ~ 0.85
(C) ~ 1.96
(D) ~ 1.40"""
query = {
    "messages": [HumanMessage(content=input)]
}
for event in app.stream(query, config, stream_mode="values"):
    event["messages"][-1].pretty_print()

================================ Human Message =================================

Astronomers are searching for exoplanets around two stars with exactly the same masses. Using the RV method, they detected one planet around each star, both with masses similar to that of Neptune. The stars themselves have masses similar to that of our Sun. Both planets were found to be in circular orbits.

Planet #1 was detected from the up to 5 miliangstrom periodic shift of a spectral line at a given wavelength. The periodic wavelength shift of the same spectral line in the spectrum of the host of planet #2 was 7 miliangstrom.

The question is: How many times is the orbital period of planet #2 longer than that of planet #1?
Choices:
(A) ~ 0.36
(B) ~ 0.85
(C) ~ 1.96
(D) ~ 1.40
================================== Ai Message ==================================

This question was answered by claude-3-5-sonnet-20240620:

Let's approach this step-by-step:

1) The radial velocity (RV) method detects planets by measuring the periodic Doppler shift in the star's spectrum caused by the planet's gravitational pull.

2) The amplitude of this shift (K) is related to the planet's orbital period (P) by the equation:

   K ∝ M_p * P^(-1/3)

   Where M_p is the planet's mass.

3) We're told that both planets have similar masses (like Neptune), and both stars have similar masses (like the Sun). This means we can consider M_p to be the same for both planets.

4) The amplitude of the shift (K) is directly proportional to the observed wavelength shift. For planet #1, this is 5 miliangstrom, and for planet #2, it's 7 miliangstrom.

5) Let's call the period of planet #1 P1 and the period of planet #2 P2. We can set up the proportion:

   5 ∝ P1^(-1/3)
   7 ∝ P2^(-1/3)

6) Dividing these:

   5/7 = (P2/P1)^(1/3)

7) Cubing both sides:

   (5/7)^3 = P2/P1

8) (5/7)^3 ≈ 0.51

9) This means P2 ≈ 0.51 * P1, or P1 ≈ 1.96 * P2

Therefore, the orbital period of planet #1 is about 1.96 times longer than that of planet #2.

The answer that best matches this is (C) ~ 1.96.

This query got routed to claude-3-5-sonnet-20240620! Now let’s try another example

config = {"configurable": {"thread_id": "2"}}

input = """You have prepared a di-substituted 6-membered aromatic ring compound. The FTIR spectrum of this compound shows absorption peaks indicating the presence of an ester group. The 1H NMR spectrum shows six signals: two signals corresponding to aromatic-H, two signals corresponding to vinyl-H (one doublet and one doublet of quartets), and two signals corresponding to –CH3 groups. There are no signals corresponding to –CH2 groups. Identify the chemical formula of this unknown compound as either C11H12O2, C11H14O2, C12H12O2, or C12H14O2.
Choices:
(A) C12H12O2
(B) C11H12O2
(C) C11H14O2
(D) C12H14O2"""
query = {
    "messages": [HumanMessage(content=input)]
}
for event in app.stream(query, config, stream_mode="values"):
    event["messages"][-1].pretty_print()

================================ Human Message =================================

You have prepared a di-substituted 6-membered aromatic ring compound. The FTIR spectrum of this compound shows absorption peaks indicating the presence of an ester group. The 1H NMR spectrum shows six signals: two signals corresponding to aromatic-H, two signals corresponding to vinyl-H (one doublet and one doublet of quartets), and two signals corresponding to –CH3 groups. There are no signals corresponding to –CH2 groups. Identify the chemical formula of this unknown compound as either C11H12O2, C11H14O2, C12H12O2, or C12H14O2.
Choices:
(A) C12H12O2
(B) C11H12O2
(C) C11H14O2
(D) C12H14O2

Looks like this example triggered the "ask_human" interrupt. If we were building a chat app, we would send this query to a human to further assist with the response. To continue the graph execution, we simply update the state with a message of our choice, and continue to the next node

human_answer = """The correct answer is A"""

response = HumanMessage(content=human_answer)
app.update_state(config, {"messages": [response]}, as_node="ask_human")
for event in app.stream(None, config, stream_mode="values"):
    event["messages"][-1].pretty_print()

================================ Human Message =================================

The correct answer is A

Conclusion

In this example, we showed how you can train a custom router using your own data, leveraging both LLMs and human-in-the-loop to make your LLM powered apps more reliable.

To try out this example, sign up to get a Not Diamond API key and run the example in this notebook.