Chapter 8: Teleprompter / Optimizer - Your Program’s Coach
Welcome to Chapter 8! In Chapter 7: Evaluate, we learned how to grade our DSPy programs using metrics and datasets to see how well they perform. That’s great for knowing our score, but what if the score isn’t high enough?
Think about building our BasicQA
program from the last chapter. Maybe we tried running it and found it only got 75% accuracy. How do we improve it?
Traditionally, we might start manually tweaking prompts:
- “Maybe I should rephrase the instructions?”
- “Should I add some examples (few-shot demonstrations)?”
- “Which examples work best?”
This manual process, often called “prompt engineering,” can be slow, tedious, and requires a lot of guesswork. Wouldn’t it be amazing if DSPy could automatically figure out the best prompts and examples for us?
That’s exactly what Teleprompters (also called Optimizers) do! They are DSPy’s built-in automated prompt engineers and program tuners.
Think of a Teleprompter as a coach for your DSPy program (the ‘student’):
- The coach observes how the student performs on practice drills (a dataset).
- It uses feedback (a metric) to figure out weaknesses.
- It suggests new strategies (better instructions, better examples) to improve performance.
- It repeats this until the student performs much better!
In this chapter, we’ll learn:
- What a Teleprompter is and the problem it solves.
- The key ingredients needed to use a Teleprompter.
- How to use a simple Teleprompter (
BootstrapFewShot
) to automatically find good few-shot examples. - The basic idea behind how Teleprompters optimize programs.
Let’s automate the improvement process!
What is a Teleprompter / Optimizer?
A Teleprompter
in DSPy is an algorithm that takes your DSPy Program (the ‘student’) and automatically tunes its internal parameters to maximize performance on a given task. These parameters are most often:
- Instructions: The natural language guidance given to the Language Models (LM) within your program’s modules (like
dspy.Predict
). - Few-Shot Examples (Demos): The
dspy.Example
objects provided in prompts to show the LM how to perform the task.
Some advanced Teleprompters can even fine-tune the weights of the LM itself!
To work its magic, a Teleprompter needs three things (sound familiar? They’re similar to evaluation!):
- The Student Program: The DSPy program you want to improve.
- A Training Dataset (
trainset
): A list ofdspy.Example
objects (Chapter 3: Example) representing the task. The Teleprompter will use this data to practice and learn. - A Metric Function (
metric
): The same kind of function we used in Chapter 7: Evaluate. It tells the Teleprompter how well the student program is doing on each example in thetrainset
.
The Teleprompter uses the metric
to guide its search for better instructions or demos, trying different combinations and keeping the ones that yield the highest score on the trainset
. The output is an optimized version of your student program.
Use Case: Automatically Finding Good Few-Shot Examples with BootstrapFewShot
Let’s revisit our BasicQA
program and the evaluation setup from Chapter 7.
import dspy
from dspy.evaluate import Evaluate
# Assume LM is configured (e.g., dspy.settings.configure(lm=...))
# Our simple program
class BasicQA(dspy.Module):
def __init__(self):
super().__init__()
self.predictor = dspy.Predict('question -> answer')
def forward(self, question):
return self.predictor(question=question)
# Our metric from Chapter 7
def simple_exact_match_metric(gold, prediction, trace=None):
return prediction.answer.lower() == gold.answer.lower()
# Our dataset from Chapter 7 (let's use it as a trainset now)
dev_example1 = dspy.Example(question="What color is the sky?", answer="blue")
dev_example2 = dspy.Example(question="What is 2 + 2?", answer="4")
dev_example3 = dspy.Example(question="What is the capital of France?", answer="Paris")
# Example our program might struggle with initially
dev_example_hard = dspy.Example(question="Who painted the Mona Lisa?", answer="Leonardo da Vinci")
trainset = [dev_example1, dev_example2, dev_example3, dev_example_hard]
trainset = [d.with_inputs('question') for d in trainset]
# Let's evaluate the initial program (likely imperfect)
initial_program = BasicQA()
evaluator = Evaluate(devset=trainset, metric=simple_exact_match_metric, display_progress=False)
initial_score = evaluator(initial_program)
print(f"Initial Score (on trainset): {initial_score}%")
# Might output: Initial Score (on trainset): 75.0% (assuming it fails the last one)
Our initial program gets 75%. We could try adding few-shot examples manually, but which ones? And how many?
Let’s use dspy.teleprompt.BootstrapFewShot
. This Teleprompter automatically creates and selects few-shot demonstrations for the predictors in your program.
1. Import the Teleprompter:
from dspy.teleprompt import BootstrapFewShot
2. Instantiate the Teleprompter: We need to give it the metric
function it should use to judge success. We can also specify how many candidate demos (max_bootstrapped_demos
) it should try to find for each predictor.
# Configure the BootstrapFewShot optimizer
# It will use the metric to find successful demonstrations
# max_bootstrapped_demos=4 means it will try to find up to 4 good examples for EACH predictor
config = dict(max_bootstrapped_demos=4, metric=simple_exact_match_metric)
teleprompter = BootstrapFewShot(**config)
3. Compile the Program: This is the main step. We call the Teleprompter’s compile
method, giving it our initial student
program and the trainset
. It returns a new, optimized program.
# Compile the program!
# This runs the optimization process using the trainset.
# It uses a 'teacher' model (often the student itself or a copy)
# to generate traces, finds successful ones via the metric,
# and adds them as demos to the student's predictors.
compiled_program = teleprompter.compile(student=initial_program, trainset=trainset)
# The 'compiled_program' is a new instance of BasicQA,
# but its internal predictor now has few-shot examples added!
What just happened?
Behind the scenes, BootstrapFewShot
(conceptually):
- Used a “teacher” program (often a copy of the student or another specified LM configuration) to run each example in the
trainset
. - For each example, it checked if the teacher’s output was correct using our
simple_exact_match_metric
. - If an example was processed correctly, the Teleprompter saved the input/output pair as a potential “demonstration” (a good example).
- It collected these successful demonstrations.
- It assigned a selection of these good demonstrations (
max_bootstrapped_demos
) to thedemos
attribute of the corresponding predictor inside ourcompiled_program
.
4. Evaluate the Compiled Program: Now, let’s see if the optimized program performs better on the same trainset
.
# Evaluate the compiled program
compiled_score = evaluator(compiled_program)
print(f"Compiled Score (on trainset): {compiled_score}%")
# If the optimization worked, the score should be higher!
# Might output: Compiled Score (on trainset): 100.0%
If BootstrapFewShot
found good examples (like the “Mona Lisa” one after the teacher model successfully answered it), the compiled_program
now has these examples embedded in its prompts, helping the LM perform better on similar questions. We automated the process of finding effective few-shot examples!
How Optimization Works (Conceptual)
Different Teleprompters use different strategies, but the core idea is usually:
- Goal: Find program parameters (instructions, demos) that maximize the
metric
score on thetrainset
. - Search Space: The “space” of all possible instructions or combinations of demos.
- Search Strategy: How the Teleprompter explores this space.
BootstrapFewShot
: Generates candidate demos based on successful teacher executions.- Other optimizers (like
COPRO
orMIPROv2
mentioned in the code snippets) might use an LM to propose new instructions, evaluate them, and iterate. Some use sophisticated search algorithms like Bayesian Optimization or random search.
- Evaluation: Use the
metric
andtrainset
to score each candidate configuration (e.g., a program with specific demos or instructions). - Selection: Keep the configuration that resulted in the best score.
Analogy Revisited:
- Coach: The Teleprompter algorithm (
BootstrapFewShot
). - Student: Your DSPy
Program
(initial_program
). - Practice Drills: The
trainset
. - Scoring: The
metric
function (simple_exact_match_metric
). - Trying Techniques: Generating/selecting different demos or instructions.
- Adopting Best Techniques: Creating the
compiled_program
with the highest-scoring demos/instructions found.
How It Works Under the Hood (BootstrapFewShot
Peek)
Let’s briefly look at the internal flow for BootstrapFewShot.compile()
:
- Prepare Teacher: It sets up a ‘teacher’ program. This is often a copy of the student program, sometimes configured with specific settings (like a higher temperature for more exploration) or potentially using labeled examples if provided (
LabeledFewShot
withinBootstrapFewShot
). - Iterate Trainset: It goes through each
example
in thetrainset
. - Teacher Execution: For each
example
, it runs theteacher
program (teacher(**example.inputs())
). This happens within adspy.settings.context
block to capture the executiontrace
. - Metric Check: It uses the provided
metric
to compare theteacher
’s prediction against theexample
’s gold label (metric(example, prediction, trace)
). - Collect Demos: If the
metric
returns success (e.g.,True
or a score above a threshold), the Teleprompter extracts the input/output steps from the executiontrace
. Each successful trace step can become a candidatedspy.Example
demonstration. - Assign Demos: After iterating through the
trainset
, it takes the collected successful demonstrations (up tomax_bootstrapped_demos
) and assigns them to thedemos
attribute of the corresponding predictors in thestudent
program instance. - Return Compiled Student: It returns the modified
student
program, which now contains the bootstrapped few-shot examples.
sequenceDiagram
participant User
participant Teleprompter as BootstrapFewShot
participant StudentProgram as Student Program
participant TeacherProgram as Teacher Program
participant LM as Language Model
participant Metric as Metric Function
participant CompiledProgram as Compiled Program (Student with Demos)
User->>Teleprompter: compile(student=StudentProgram, trainset=...)
Teleprompter->>TeacherProgram: Set up (copy of student, potentially modified)
loop For each example in trainset
Teleprompter->>TeacherProgram: Run example.inputs()
TeacherProgram->>LM: Make calls (via Predictors)
LM-->>TeacherProgram: Return predictions
TeacherProgram-->>Teleprompter: Return final prediction & trace
Teleprompter->>Metric: Evaluate(example, prediction, trace)
Metric-->>Teleprompter: Return score (success/fail)
alt Metric returns success
Teleprompter->>Teleprompter: Extract demo from trace
end
end
Teleprompter->>StudentProgram: Assign selected demos to predictors
StudentProgram-->>CompiledProgram: Create compiled version
Teleprompter-->>User: Return CompiledProgram
Relevant Code Files:
dspy/teleprompt/teleprompt.py
: Defines the baseTeleprompter
class.dspy/teleprompt/bootstrap.py
: Contains the implementation forBootstrapFewShot
. Key methods includecompile
(orchestrates the process) and_bootstrap_one_example
(handles running the teacher and checking the metric for a single training example).
# Simplified view from dspy/teleprompt/bootstrap.py
# ... imports ...
from .teleprompt import Teleprompter
from .vanilla import LabeledFewShot # Used for teacher setup if labeled demos are needed
import dspy
class BootstrapFewShot(Teleprompter):
def __init__(self, metric=None, max_bootstrapped_demos=4, ...):
self.metric = metric
self.max_bootstrapped_demos = max_bootstrapped_demos
# ... other initializations ...
def compile(self, student, *, teacher=None, trainset):
self.trainset = trainset
self._prepare_student_and_teacher(student, teacher) # Sets up self.student and self.teacher
self._prepare_predictor_mappings() # Links student predictors to teacher predictors
self._bootstrap() # Runs the core bootstrapping logic
self.student = self._train() # Assigns collected demos to the student
self.student._compiled = True
return self.student
def _bootstrap(self):
# ... setup ...
self.name2traces = {name: [] for name in self.name2predictor} # Store successful traces per predictor
for example_idx, example in enumerate(tqdm.tqdm(self.trainset)):
# ... logic to stop early if enough demos found ...
success = self._bootstrap_one_example(example, round_idx=0) # Try to get a demo from this example
# ... potentially multiple rounds ...
# ... logging ...
def _bootstrap_one_example(self, example, round_idx=0):
# ... setup teacher context (e.g., temperature) ...
try:
with dspy.settings.context(trace=[], **self.teacher_settings):
# Optionally modify teacher LM settings for exploration
# ...
# Run the teacher program
prediction = self.teacher(**example.inputs())
trace = dspy.settings.trace # Get the execution trace
# Evaluate the prediction using the metric
if self.metric:
metric_val = self.metric(example, prediction, trace)
# Determine success based on metric value/threshold
success = bool(metric_val) # Simplified
else:
success = True # Assume success if no metric provided
except Exception:
success = False
# ... error handling ...
if success:
# If successful, extract demos from the trace
for step in trace:
predictor, inputs, outputs = step
demo = dspy.Example(augmented=True, **inputs, **outputs)
try:
predictor_name = self.predictor2name[id(predictor)]
# Store the successful demo example
self.name2traces[predictor_name].append(demo)
except KeyError:
continue # Handle potential issues finding the predictor
return success
def _train(self):
# Assign the collected demos to the student's predictors
for name, predictor in self.student.named_predictors():
demos_for_predictor = self.name2traces[name][:self.max_bootstrapped_demos]
# Potentially mix with labeled demos if configured
# ...
predictor.demos = demos_for_predictor # Assign the demos!
return self.student
This simplified view shows the core loop: run the teacher, check the metric, collect successful traces as demos, and finally assign those demos to the student program.
Conclusion
You’ve now learned about DSPy’s Teleprompters / Optimizers, the powerful tools for automating prompt engineering!
- Teleprompters act like coaches, automatically tuning your DSPy programs (students).
- They optimize parameters like instructions and few-shot examples (demos).
- They require a student program, a training dataset, and a metric function.
- We saw how
BootstrapFewShot
automatically finds effective few-shot examples by running a teacher model and collecting successful execution traces. - The result of
teleprompter.compile()
is an optimized program instance, ready to be used or evaluated further.
Teleprompters save you from the tedious process of manual tuning, allowing you to build high-performing LM-based programs more efficiently.
Now that we understand how to build, evaluate, and automatically optimize DSPy programs, how can we make them interact smoothly with different data formats or models, especially when integrating with other systems? That’s where Adapters come in.
Next: Chapter 9: Adapter
Generated by AI Codebase Knowledge Builder