DSPy + GEPA: 0→1 Builder's Guide

DSPy is valuable because it moves LLM work away from hand-edited prompt folklore and toward programmable systems with signatures, modules, metrics, and optimizers. The current DSPy framing is not "prompt engineering, but cleaner." It is "declare the behavior, then compile the program against a real eval."

Current framing

Think in signatures, modules, metrics, and optimizers

The current DSPy docs emphasize a small set of durable abstractions: signatures define the task contract, modules compose behavior, metrics define success, and optimizers such as MIPROv2 and GEPA search for stronger prompt-and-example configurations. The prompt is the compiled artifact of the program, not the program itself. The older "teleprompter" language still appears in historical examples, but current DSPy docs and tutorials center optimizer terminology.

Baseline

Readable first

If the unoptimized program is confusing, compile time will only hide the confusion.

Scoring

Metric first

The optimizer can only improve what the metric can recognize.

Data split

Train, dev, holdout

Compile on one slice, validate on another, and keep a holdout for release confidence.

Upgrade path

Optimizer choice

Use simpler optimizers for narrow tasks and heavier search only when the metric and data justify it.

1. When DSPy is the right abstraction

DSPy is not the default answer to every prompt problem. It is most useful when the task runs repeatedly, the output quality is measurable, and the program has enough structure that compilation can beat manual wording tweaks.

Situation	Better default	Reason
You are exploring a one-off idea or debugging a tiny task	Stay with a direct prompt or a thin wrapper	DSPy adds ceremony when the task does not repeat enough to justify optimization.
You have a recurring task with a clear output contract	Use DSPy signatures and a readable baseline module	You get a stable task contract and a foundation for optimization.
You can build train, dev, and holdout examples	Use DSPy with a real metric and optimizer	This is where compile-time search starts to outperform vibe-based prompt editing.
You do not have an eval set or the task is still moving weekly	Do not optimize yet	Compiling against a weak metric or unstable spec only formalizes noise.

2. Core abstractions to learn first

Task contract

Signatures describe the shape of work

A signature tells DSPy what inputs the model should see and what outputs you expect back. This is the first thing to get right.

Keep input fields narrow and explicit

Use output fields that can be graded

Treat descriptions as task documentation

Program structure

Modules compose model behavior

A module lets you compose predictors, retrieval steps, tool use, and helper logic without losing the program shape.

Use small modules with clear boundaries

Compose rather than stuffing everything into one signature

Inspect intermediate outputs during development

Measurement

Metrics decide what "better" means

DSPy is only as good as the metric you hand it. Weak metrics optimize the wrong thing very efficiently.

Prefer deterministic checks where possible

Use held-out data to check generalization

Track quality, cost, and latency separately

Optimization

Optimizers search the prompt-and-demo space

Current DSPy docs and tutorials put MIPROv2 and GEPA at the center of serious optimization work. Use them after you have a real metric and data split.

Start with a baseline before optimizing

Inspect compiled programs, do not treat them as black boxes

Recompile when the task or model changes materially

3. Minimal modern example

The baseline workflow is still simple: configure an LM, define a signature, wrap it in a module, and run a task. What changes with DSPy is what happens next: you can now optimize the behavior systematically.

Minimal baseline

Readable program first, optimizer second

This is the kind of baseline you should be able to explain out loud before you compile anything.

python

import dspy

lm = dspy.LM("provider/model")
dspy.configure(lm=lm)

class SupportTriage(dspy.Signature):
    """Classify the ticket and propose the next operator action."""
    ticket: str = dspy.InputField()
    severity: str = dspy.OutputField(desc="high | medium | low")
    next_action: str = dspy.OutputField()

class TriageProgram(dspy.Module):
    def __init__(self):
        super().__init__()
        self.triage = dspy.Predict(SupportTriage)

    def forward(self, ticket: str):
        return self.triage(ticket=ticket)

program = TriageProgram()
result = program(ticket="Customer payment failed twice and dashboard access is blocked.")

Important

Do not optimize before the baseline is legible

If you cannot explain what the baseline program is supposed to do, which cases it fails, and how the metric scores it, optimization is premature. DSPy helps disciplined teams; it does not rescue vague tasks.

4. The compile loop that actually works

Define the task contract

Write signatures and modules that keep the program understandable. This is where most quality is won or lost.

Assemble train and dev sets

Use representative examples with edge cases. Keep a holdout split so you can detect overfitting during compile time.

Write a metric that reflects reality

Prefer deterministic scoring where possible; use rubric or model grading only when the task truly requires judgment.

Compile with an optimizer

Use a current DSPy optimizer such as MIPROv2 or GEPA once the baseline and metric are trustworthy.

Inspect the compiled artifact

Read the resulting instructions and examples. The compiled program is not sacred; it is an artifact you should understand.

Validate on holdout and ship with versioning

Track the compiled program version, the model backend, and the eval result together. Otherwise rollback gets messy fast.

5. What to inspect after compile

Instructions

Read the compiled prompt like production code

Do the instructions still reflect the task you meant to solve, or did the optimizer learn shortcuts that only work on the dev slice?

Examples

Check whether demos are aligned or suspiciously narrow

If compiled examples over-index on one phrasing pattern or one happy path, you may be overfitting to the train split.

Failure slices

Compare aggregate gains against specific regressions

A higher mean score can still hide retrieval collapse, schema drift, or worse performance on edge cases. Inspect slice-level failures explicitly.

Backend sensitivity

Assume compiled programs are model-sensitive artifacts

Model swaps, context-window changes, and provider behavior shifts all warrant a recompile-and-validate loop.

6. Choose the optimizer by failure mode

Optimizer style	Best use	Caution
Bootstrap-style few-shot optimization	Good first pass when the task is narrow and the metric is simple	It can plateau quickly if the task needs better instruction search.
MIPROv2-style instruction and demo search	Good default when you want serious prompt-and-example optimization without writing a custom search loop	Needs a meaningful dev set and can optimize the wrong behavior if the metric is weak.
GEPA	Strong when you care about multi-objective tradeoffs or richer search over program variants	More power does not excuse weak metrics. Garbage metrics still poison the frontier.

7. When DSPy is overkill

No eval loop

You cannot score quality yet

If there is no trustworthy metric or labeled set, you do not have enough signal to justify optimization.

Task churn

The workflow is still changing too fast

If the task contract, schema, or business rules are still moving every week, keep the baseline thin until the system stabilizes.

Tiny surface area

A plain prompt already clears the bar

When the problem is small, deterministic, and already readable, DSPy can add more abstraction than value.

No artifact discipline

Your team will not version compiled outputs

If the compiled program, backend, metric, and dataset snapshot will not be tracked together, rollback and debugging get messy fast.

8. Sharp checklist

Start with a clear signature and a readable baseline module.
Use DSPy only when the task repeats enough and the output quality is measurable.
Use real train/dev data, not a handful of cherry-picked happy-path examples.
Invest in the metric before you invest in optimizer settings.
Inspect compiled programs instead of treating optimization as magic.
Track quality, cost, and latency separately.
Version compiled artifacts and revalidate them on model changes.

Where to go next

DSPy gets stronger when paired with evaluation and workflow design

Pair this with Evaluation to design better metrics, Mastering RAG if your program depends on retrieval, and Agentic RFT if you want to think beyond prompt optimization into policy improvement loops.

Open the evaluation guide

Check Understanding

intermediate

Short, practical guide to DSPy programming and GEPA optimization.

3 questions

12 min

70% to pass

DSPy: Programming Language Models