Versalist guides
Evaluation
Intermediate

Public learning references for AI builders. Browse the full directory or stay in this track and move to the next guide.

Public guide
Evaluation
Intermediate
Code lab

DSPy: Programming Language Models

Short, practical guidance for DSPy programming and GEPA-style prompt optimization.

Best for

Teams with recurring prompt tasks and enough eval signal to justify compile-time optimization.

Track position
5/7

Best when quality debates need to turn into measurable checks.

Outcome
Compose DSPy modules that optimize prompts automatically against measurable evals.
Guide map
4 min
0 sections5 of 7 in track
Focus
DSPyOptimizationProgrammatic prompts
Prerequisites
Python familiarityA measurable prompt or agent task
You leave with
Baseline module templateCompile loopOptimizer selection guide

DSPy is valuable because it moves LLM work away from hand-edited prompt folklore and toward programmable systems with signatures, modules, metrics, and optimizers. The current DSPy framing is not "prompt engineering, but cleaner." It is "declare the behavior, then compile the program against a real eval."

Current framing
Think in signatures, modules, metrics, and optimizers
The current DSPy docs emphasize a small set of durable abstractions: signatures define the task contract, modules compose behavior, metrics define success, and optimizers such as MIPROv2 and GEPA search for stronger prompt-and-example configurations. The prompt is the compiled artifact of the program, not the program itself. The older "teleprompter" language still appears in historical examples, but current DSPy docs and tutorials center optimizer terminology.
Baseline
Readable first

If the unoptimized program is confusing, compile time will only hide the confusion.

Scoring
Metric first

The optimizer can only improve what the metric can recognize.

Data split
Train, dev, holdout

Compile on one slice, validate on another, and keep a holdout for release confidence.

Upgrade path
Optimizer choice

Use simpler optimizers for narrow tasks and heavier search only when the metric and data justify it.

1. When DSPy is the right abstraction

DSPy is not the default answer to every prompt problem. It is most useful when the task runs repeatedly, the output quality is measurable, and the program has enough structure that compilation can beat manual wording tweaks.

SituationBetter defaultReason
You are exploring a one-off idea or debugging a tiny taskStay with a direct prompt or a thin wrapperDSPy adds ceremony when the task does not repeat enough to justify optimization.
You have a recurring task with a clear output contractUse DSPy signatures and a readable baseline moduleYou get a stable task contract and a foundation for optimization.
You can build train, dev, and holdout examplesUse DSPy with a real metric and optimizerThis is where compile-time search starts to outperform vibe-based prompt editing.
You do not have an eval set or the task is still moving weeklyDo not optimize yetCompiling against a weak metric or unstable spec only formalizes noise.

2. Core abstractions to learn first

Task contract
Signatures describe the shape of work
A signature tells DSPy what inputs the model should see and what outputs you expect back. This is the first thing to get right.
Keep input fields narrow and explicit
Use output fields that can be graded
Treat descriptions as task documentation
Program structure
Modules compose model behavior
A module lets you compose predictors, retrieval steps, tool use, and helper logic without losing the program shape.
Use small modules with clear boundaries
Compose rather than stuffing everything into one signature
Inspect intermediate outputs during development
Measurement
Metrics decide what "better" means
DSPy is only as good as the metric you hand it. Weak metrics optimize the wrong thing very efficiently.
Prefer deterministic checks where possible
Use held-out data to check generalization
Track quality, cost, and latency separately
Optimization
Optimizers search the prompt-and-demo space
Current DSPy docs and tutorials put MIPROv2 and GEPA at the center of serious optimization work. Use them after you have a real metric and data split.
Start with a baseline before optimizing
Inspect compiled programs, do not treat them as black boxes
Recompile when the task or model changes materially

3. Minimal modern example

The baseline workflow is still simple: configure an LM, define a signature, wrap it in a module, and run a task. What changes with DSPy is what happens next: you can now optimize the behavior systematically.

Minimal baseline
Readable program first, optimizer second
This is the kind of baseline you should be able to explain out loud before you compile anything.
python
import dspy

lm = dspy.LM("provider/model")
dspy.configure(lm=lm)

class SupportTriage(dspy.Signature):
    """Classify the ticket and propose the next operator action."""
    ticket: str = dspy.InputField()
    severity: str = dspy.OutputField(desc="high | medium | low")
    next_action: str = dspy.OutputField()

class TriageProgram(dspy.Module):
    def __init__(self):
        super().__init__()
        self.triage = dspy.Predict(SupportTriage)

    def forward(self, ticket: str):
        return self.triage(ticket=ticket)

program = TriageProgram()
result = program(ticket="Customer payment failed twice and dashboard access is blocked.")
Important
Do not optimize before the baseline is legible
If you cannot explain what the baseline program is supposed to do, which cases it fails, and how the metric scores it, optimization is premature. DSPy helps disciplined teams; it does not rescue vague tasks.

4. The compile loop that actually works

1
Define the task contract
Write signatures and modules that keep the program understandable. This is where most quality is won or lost.
2
Assemble train and dev sets
Use representative examples with edge cases. Keep a holdout split so you can detect overfitting during compile time.
3
Write a metric that reflects reality
Prefer deterministic scoring where possible; use rubric or model grading only when the task truly requires judgment.
4
Compile with an optimizer
Use a current DSPy optimizer such as MIPROv2 or GEPA once the baseline and metric are trustworthy.
5
Inspect the compiled artifact
Read the resulting instructions and examples. The compiled program is not sacred; it is an artifact you should understand.
6
Validate on holdout and ship with versioning
Track the compiled program version, the model backend, and the eval result together. Otherwise rollback gets messy fast.

5. What to inspect after compile

Instructions
Read the compiled prompt like production code
Do the instructions still reflect the task you meant to solve, or did the optimizer learn shortcuts that only work on the dev slice?
Examples
Check whether demos are aligned or suspiciously narrow
If compiled examples over-index on one phrasing pattern or one happy path, you may be overfitting to the train split.
Failure slices
Compare aggregate gains against specific regressions
A higher mean score can still hide retrieval collapse, schema drift, or worse performance on edge cases. Inspect slice-level failures explicitly.
Backend sensitivity
Assume compiled programs are model-sensitive artifacts
Model swaps, context-window changes, and provider behavior shifts all warrant a recompile-and-validate loop.

6. Choose the optimizer by failure mode

Optimizer styleBest useCaution
Bootstrap-style few-shot optimizationGood first pass when the task is narrow and the metric is simpleIt can plateau quickly if the task needs better instruction search.
MIPROv2-style instruction and demo searchGood default when you want serious prompt-and-example optimization without writing a custom search loopNeeds a meaningful dev set and can optimize the wrong behavior if the metric is weak.
GEPAStrong when you care about multi-objective tradeoffs or richer search over program variantsMore power does not excuse weak metrics. Garbage metrics still poison the frontier.

7. When DSPy is overkill

No eval loop
You cannot score quality yet
If there is no trustworthy metric or labeled set, you do not have enough signal to justify optimization.
Task churn
The workflow is still changing too fast
If the task contract, schema, or business rules are still moving every week, keep the baseline thin until the system stabilizes.
Tiny surface area
A plain prompt already clears the bar
When the problem is small, deterministic, and already readable, DSPy can add more abstraction than value.
No artifact discipline
Your team will not version compiled outputs
If the compiled program, backend, metric, and dataset snapshot will not be tracked together, rollback and debugging get messy fast.

8. Sharp checklist

  • Start with a clear signature and a readable baseline module.
  • Use DSPy only when the task repeats enough and the output quality is measurable.
  • Use real train/dev data, not a handful of cherry-picked happy-path examples.
  • Invest in the metric before you invest in optimizer settings.
  • Inspect compiled programs instead of treating optimization as magic.
  • Track quality, cost, and latency separately.
  • Version compiled artifacts and revalidate them on model changes.
Where to go next
DSPy gets stronger when paired with evaluation and workflow design
Pair this with Evaluation to design better metrics, Mastering RAG if your program depends on retrieval, and Agentic RFT if you want to think beyond prompt optimization into policy improvement loops.
Open the evaluation guide

Test Your Knowledge

intermediate

Short, practical guide to DSPy programming and GEPA optimization.

3 questions
12 min
70% to pass

Sign in to take this quiz

Create an account to take the quiz, track your progress, and see how you compare with other learners.

Continue exploring

Move laterally within the same track or jump to the next bottleneck in your AI stack.