Chapter 7: Evaluate - Grading Your Program

In the previous chapter, Chapter 6: RM (Retrieval Model Client), we learned how to connect our DSPy program to external knowledge sources using Retrieval Models (RMs). We saw how combining RMs with Language Models (LMs) allows us to build sophisticated programs like Retrieval-Augmented Generation (RAG) systems.

Now that we can build these powerful programs, a crucial question arises: How good are they? If we build a RAG system to answer questions, how often does it get the answer right? How do we measure its performance objectively?

This is where dspy.Evaluate comes in! It’s DSPy’s built-in tool for testing and grading your programs.

Think of dspy.Evaluate as:

An Automated Grader: Like a teacher grading a batch of homework assignments based on an answer key.
A Test Suite Runner: Similar to how software developers use test suites to check if their code works correctly.
Your Program’s Report Card: It gives you a score that tells you how well your DSPy program is performing on a specific set of tasks.

In this chapter, you’ll learn:

What you need to evaluate a DSPy program.
How to define a metric (a grading rule).
How to use dspy.Evaluate to run the evaluation and get a score.
How it works behind the scenes.

Let’s learn how to grade our DSPy creations!

The Ingredients for Evaluation

To grade your program using dspy.Evaluate, you need three main ingredients:

Your DSPy Program: The program you want to test. This could be a simple dspy.Predict module or a complex multi-step program like the SimpleRAG we sketched out in the last chapter.
A Dataset (devset): A list of dspy.Example objects (Chapter 3: Example). Crucially, these examples must contain not only the inputs your program expects but also the gold standard outputs (the correct answers or desired results) that you want to compare against. This dataset is often called a “development set” or “dev set”.
A Metric Function (metric): A Python function you define. This function takes one gold standard Example and the Prediction generated by your program for that example’s inputs. It then compares them and returns a score indicating how well the prediction matched the gold standard. The score is often 1.0 for a perfect match and 0.0 for a mismatch, but it can also be a fractional score (e.g., for F1 score).

dspy.Evaluate takes these three ingredients, runs your program on all examples in the dataset, uses your metric function to score each prediction against the gold standard, and finally reports the average score across the entire dataset.

Evaluating a Simple Question Answering Program

Let’s illustrate this with a simple example. Suppose we have a basic DSPy program that’s supposed to answer simple questions.

import dspy

# Assume we have configured an LM client (Chapter 5)
# gpt3_turbo = dspy.LM(model='openai/gpt-3.5-turbo')
# dspy.settings.configure(lm=gpt3_turbo)

# A simple program using dspy.Predict (Chapter 4)
class BasicQA(dspy.Module):
    def __init__(self):
        super().__init__()
        # Use a simple signature: question -> answer
        self.predictor = dspy.Predict('question -> answer')

    def forward(self, question):
        return self.predictor(question=question)

# Create an instance of our program
qa_program = BasicQA()

Now, let’s prepare the other ingredients for evaluation.

1. Prepare the Dataset (devset)

We need a list of dspy.Example objects, each containing a question (input) and the correct answer (gold standard output).

# Create example data points with questions and gold answers
dev_example1 = dspy.Example(question="What color is the sky?", answer="blue")
dev_example2 = dspy.Example(question="What is 2 + 2?", answer="4")
dev_example3 = dspy.Example(question="What is the capital of France?", answer="Paris")
dev_example_wrong = dspy.Example(question="Who wrote Hamlet?", answer="Shakespeare") # Let's assume our QA program might get this wrong

# Create the development set (list of examples)
devset = [dev_example1, dev_example2, dev_example3, dev_example_wrong]

# We need to tell DSPy which fields are inputs vs outputs for evaluation
# The .with_inputs() method marks the input keys.
# The remaining keys ('answer' in this case) are treated as labels.
devset = [d.with_inputs('question') for d in devset]

Here, we’ve created a small dataset devset with four question-answer pairs. We used .with_inputs('question') to mark the question field as the input; dspy.Evaluate will automatically treat the remaining field (answer) as the gold label to compare against.

2. Define a Metric Function (metric)

We need a function that compares the program’s predicted answer to the gold answer in an example. Let’s create a simple “exact match” metric.

def simple_exact_match_metric(gold_example, prediction, trace=None):
    # Does the predicted 'answer' EXACTLY match the gold 'answer'?
    # '.answer' field comes from our Predict signature 'question -> answer'
    # 'gold_example.answer' is the gold label from the devset example
    return prediction.answer == gold_example.answer

# Note: DSPy often provides common metrics too, like dspy.evaluate.answer_exact_match
# import dspy.evaluate
# metric = dspy.evaluate.answer_exact_match

Our simple_exact_match_metric function takes the gold dspy.Example (gold_example) and the program’s output dspy.Prediction (prediction). It returns True (which Python treats as 1.0) if the predicted answer matches the gold answer, and False (0.0) otherwise. The trace argument is optional and can be ignored for basic metrics; it sometimes contains information about the program’s execution steps.

3. Create and Run dspy.Evaluate

Now we have all the ingredients: qa_program, devset, and simple_exact_match_metric. Let’s use dspy.Evaluate.

from dspy.evaluate import Evaluate

# 1. Create the Evaluator instance
evaluator = Evaluate(
    devset=devset,            # The dataset to evaluate on
    metric=simple_exact_match_metric, # The function to score predictions
    num_threads=4,            # Run 4 evaluations in parallel (optional)
    display_progress=True,    # Show a progress bar (optional)
    display_table=True        # Display results in a table (optional)
)

# 2. Run the evaluation by calling the evaluator with the program
# This will run qa_program on each example in devset,
# score it using simple_exact_match_metric, and return the average score.
average_score = evaluator(qa_program)

print(f"Average Score: {average_score}%")

What happens here?

We create an Evaluate object, providing our dataset and metric. We also request parallel execution (num_threads=4) for speed and ask for progress/table display.
We call the evaluator instance with our qa_program.
Evaluate iterates through devset:
- For dev_example1, it calls qa_program(question="What color is the sky?"). Let’s assume the program predicts answer="blue".
- It calls simple_exact_match_metric(dev_example1, predicted_output). Since "blue" == "blue", the score is 1.0.
- It does the same for dev_example2 (input: “What is 2 + 2?”). Assume prediction is answer="4". Score: 1.0.
- It does the same for dev_example3 (input: “What is the capital of France?”). Assume prediction is answer="Paris". Score: 1.0.
- It does the same for dev_example_wrong (input: “Who wrote Hamlet?”). Maybe the simple LM messes up and predicts answer="William Shakespeare". Since "William Shakespeare" != "Shakespeare", the score is 0.0.
Evaluate calculates the average score: (1.0 + 1.0 + 1.0 + 0.0) / 4 = 0.75.
It prints the average score as a percentage.

Expected Output:

A progress bar will be shown (if tqdm is installed), followed by a table like this (requires pandas):

Average Metric: 3 / 4  (75.0%)
  question                           answer      simple_exact_match_metric
0 What color is the sky?           blue        ✔️ [True]
1 What is 2 + 2?                   4           ✔️ [True]
2 What is the capital of France?   Paris       ✔️ [True]
3 Who wrote Hamlet?                Shakespeare 

(Note: The table shows the predicted answer if different, otherwise just the metric outcome. The exact table format might vary slightly).

And finally:

Average Score: 75.0%

This tells us our simple QA program achieved 75% accuracy on our small development set using the exact match criterion.

Getting More Details (Optional Flags)

Sometimes, just the average score isn’t enough. You might want to see the score for each individual example or the actual predictions made by the program. Evaluate provides flags for this:

return_all_scores=True: Returns the average score and a list containing the individual score for each example.
return_outputs=True: Returns the average score and a list of tuples, where each tuple contains (example, prediction, score).

# Re-run evaluation asking for more details
evaluator_detailed = Evaluate(devset=devset, metric=simple_exact_match_metric)

# Get individual scores
avg_score, individual_scores = evaluator_detailed(qa_program, return_all_scores=True)
print(f"Individual Scores: {individual_scores}") # Output: [True, True, True, False]

# Get full outputs
avg_score, outputs_list = evaluator_detailed(qa_program, return_outputs=True)
# outputs_list[0] would be roughly: (dev_example1, Prediction(answer='blue'), True)
# outputs_list[3] would be roughly: (dev_example_wrong, Prediction(answer='William Shakespeare'), False)
print(f"Number of outputs returned: {len(outputs_list)}") # Output: 4

These flags are useful for more detailed error analysis to understand where your program is failing.

How It Works Under the Hood

What happens internally when you call evaluator(program)?

Initialization: The Evaluate instance stores the devset, metric, num_threads, and other settings.
Parallel Executor: It creates a ParallelExecutor (if num_threads > 1) to manage running the evaluations concurrently.
Iteration: It iterates through each example in the devset.
Program Execution: For each example, it calls program(**example.inputs()) (e.g., qa_program(question=example.question)). This runs your DSPy program’s forward method to get a prediction.
Metric Calculation: It calls the provided metric function, passing it the original example (which contains the gold labels) and the prediction object returned by the program (e.g., metric(example, prediction)). This yields a score.
Error Handling: If running the program or the metric causes an error for a specific example, Evaluate catches it (up to max_errors), records a default failure_score (usually 0.0), and continues with the rest of the dataset.
Aggregation: It collects all the individual scores (including failure scores).
Calculate Average: It computes the average score by summing all scores and dividing by the total number of examples in the devset.
Return Results: It returns the average score (and optionally the individual scores or full output tuples based on the flags).

Here’s a simplified sequence diagram:

sequenceDiagram
    participant User
    participant Evaluator as dspy.Evaluate
    participant Executor as ParallelExecutor
    participant Program as Your DSPy Program
    participant Metric as Your Metric Function

    User->>Evaluator: __call__(program)
    Evaluator->>Executor: Create (manages threads)
    loop For each example in devset
        Executor->>Executor: Assign task to a thread
        Note over Executor, Program: In parallel thread:
        Executor->>Program: Call program(**example.inputs())
        Program-->>Executor: Return prediction
        Executor->>Metric: Call metric(example, prediction)
        Metric-->>Executor: Return score
    end
    Executor->>Evaluator: Collect all results (predictions, scores)
    Evaluator->>Evaluator: Calculate average score
    Evaluator-->>User: Return average score (and other requested data)

Relevant Code Files:

dspy/evaluate/evaluate.py: Defines the Evaluate class.
- The __init__ method stores the configuration.
- The __call__ method orchestrates the evaluation: sets up the ParallelExecutor, defines the process_item function (which runs the program and metric for one example), executes it over the devset, aggregates results, and handles display/return logic.
dspy/utils/parallelizer.py: Contains the ParallelExecutor class used for running tasks concurrently across multiple threads or processes.
dspy/evaluate/metrics.py: Contains implementations of common metrics like answer_exact_match.

# Simplified view from dspy/evaluate/evaluate.py

# ... imports ...
from dspy.utils.parallelizer import ParallelExecutor

class Evaluate:
    def __init__(self, devset, metric, num_threads=1, ..., failure_score=0.0):
        self.devset = devset
        self.metric = metric
        self.num_threads = num_threads
        self.display_progress = ...
        self.display_table = ...
        # ... store other flags ...
        self.failure_score = failure_score

    # @with_callbacks # Decorator handles optional logging/callbacks
    def __call__(self, program, metric=None, devset=None, ...):
        # Use provided args or fall back to instance attributes
        metric = metric if metric is not None else self.metric
        devset = devset if devset is not None else self.devset
        num_threads = ... # Similar logic for other args

        # Create executor for parallelism
        executor = ParallelExecutor(num_threads=num_threads, ...)

        # Define the work to be done for each example
        def process_item(example):
            try:
                # Run the program with the example's inputs
                prediction = program(**example.inputs())
                # Call the metric function with the gold example and prediction
                score = metric(example, prediction)
                return prediction, score
            except Exception as e:
                # Handle errors during program/metric execution
                # Log error, return None or failure score
                print(f"Error processing example: {e}")
                return None # Executor will handle None later

        # Execute process_item for all examples in devset using the executor
        raw_results = executor.execute(process_item, devset)

        # Process results, handle failures (replace None with failure score)
        results = []
        for i, r in enumerate(raw_results):
            example = devset[i]
            if r is None: # Execution failed for this example
                prediction, score = dspy.Prediction(), self.failure_score
            else:
                prediction, score = r
            results.append((example, prediction, score))

        # Calculate the average score
        total_score = sum(score for *_, score in results)
        num_examples = len(devset)
        average_score = round(100 * total_score / num_examples, 2) if num_examples > 0 else 0

        # Display table if requested
        if self.display_table:
             self._display_result_table(...) # Internal helper function

        # Return results based on flags (return_all_scores, return_outputs)
        # ... logic to construct return tuple ...
        return average_score # Base return value

The core logic involves running the program and the metric function for each data point, handling potential errors, and averaging the results, with parallel processing to speed things up.

Conclusion

You’ve now learned about dspy.Evaluate, the standard way to measure the performance of your DSPy programs!

Evaluate acts as an automated grader for your DSPy modules.
It requires three ingredients: your program, a dataset (devset) with gold labels, and a metric function to compare predictions against labels.
It runs the program on the dataset, applies the metric, and reports the average score.
It supports parallel execution for speed and offers options to display progress, show results tables, and return detailed outputs.

Knowing how well your program performs is essential. But what if the score isn’t good enough? How can we improve the program, perhaps by automatically finding better prompts or few-shot examples?

That’s precisely what Teleprompters (Optimizers) are designed for! Let’s dive into how DSPy can help automatically optimize your programs next.

Next: Chapter 8: Teleprompter / Optimizer

Generated by AI Codebase Knowledge Builder