Versalist guides
Prompting
Starter

Public learning references for AI builders. Browse the full directory or stay in this track and move to the next guide.

Public guide
Prompting
Starter
Checklist

Prompt Guide

A structured walkthrough for crafting reliable prompts across common LLM tasks.

Best for

Operators debugging live prompts and trying to stop prompt edits from feeling random.

Track position
2/2

Best when the system already works, but the output contract keeps drifting.

Outcome
Use a repeatable checklist to debug, version, and improve prompts without guesswork.
Guide map
4 min
0 sections2 of 2 in track
Focus
Task framingParameter tuningPrompt debugging
Prerequisites
A prompt you can test repeatedlyBasic familiarity with prompt outputs
You leave with
Prompt spec templateRelease checklistFailure-triage framework

Prompting in 2026 is not a copywriting trick. It is interface design for model behavior: define the job, structure the context, constrain the output, test against a dataset, and version changes like code. The sharp teams are not winning with clever phrasing. They are winning with better contracts and better eval loops.

What changed
The modern prompt stack is spec plus eval plus rollout
Current OpenAI guidance centers prompts around versioned prompt objects, variables, linked evals, and prompt optimization from dataset feedback. Current OpenAI reasoning guidance also explicitly recommends short, direct developer instructions and warns against forcing chain-of-thought. Anthropic's latest Claude guidance pushes the same direction from another angle: be explicit, use XML tags when the prompt has multiple moving parts, and treat examples and caching strategy as first-class engineering choices.
Default posture
Write prompts like interfaces

Success criteria, inputs, output contract, and failure handling should all be visible.

Reasoning models
Prefer direct instructions

OpenAI recommends avoiding generic "think step by step" prompting for reasoning models.

Prompt assets
Version them

Use prompt IDs, variables, and rollback paths instead of editing one giant string in place.

Optimization loop
Dataset -> graders -> iterate

Treat prompt improvements as measured changes, not vibes.

1. What to stop doing

Most weak prompts fail for one of four reasons: the task is underspecified, the context is noisy, the output format is vague, or the team ships a change without a dataset to catch regressions. Fix those before you reach for exotic prompting techniques.

Old habitSharper move nowWhy it wins
Write one giant monolithic promptSplit reusable instructions, examples, and runtime variables into a versioned prompt assetIt becomes testable, reviewable, and easier to roll back.
Tell every model to "think step by step"For reasoning models, keep the instruction direct and specify the end conditionOpenAI reasoning guidance says explicit chain-of-thought prompting is usually unnecessary and can hurt.
Throw in random few-shot examplesStart zero-shot; add 3-5 tightly aligned examples only when the output shape really needs themBoth OpenAI and Anthropic guidance now push for tighter example discipline.
Ask for prose and parse it laterRequest a strict schema, tool call, or explicit sections from the startYou reduce format drift and downstream parser work.
Tune prompts by gut feelLink the prompt to evals, graders, annotations, and release criteriaThis is the difference between a demo and a production loop.

2. The current production workflow

The fastest way to improve prompt quality is to stop editing prompts in isolation. Work in a loop that starts with task definition and ends with measured rollout.

1
Write the task spec first
State the job, allowed sources, quality bar, refusal behavior, and what a bad answer looks like before you draft the prompt.
2
Define the output contract
Decide whether you want JSON, XML, markdown sections, a tool call, or a grader-friendly rubric before you ask the model anything.
3
Start with the simplest viable prompt
Use one clean instruction set and zero-shot examples first. Only add examples or decomposition after you know the baseline failure pattern.
4
Create an eval set
Use representative real inputs, edge cases, and failure traps. A prompt without a dataset cannot be improved with confidence.
5
Grade and annotate
Use deterministic checks where possible, model graders where necessary, and human comments when the failure is subtle.
6
Publish and version
Ship prompt versions deliberately, compare them against the prior version, and keep rollback cheap.

3. Build prompts like interfaces, not essays

A strong prompt exposes its structure. If a teammate cannot glance at it and tell what the model is supposed to do, what context is available, and what shape the output must take, the prompt is still too mushy.

Role and goal
Lead with the outcome, not the backstory
State the task in one line, then define what counts as success. Keep the model pointed at the end condition.
Name the task explicitly
State the success criteria
List non-negotiable constraints
Context
Separate facts from instructions
When the prompt includes policy text, documents, examples, and user input, put each into its own labeled section.
Use XML or markdown delimiters
Label source material clearly
Avoid mixing examples with real runtime input
Output
Make the response easy to grade
If another model, service, or human reviewer needs to inspect the output, design for that explicitly.
Use a schema or field list
Define refusal behavior
Specify when to cite, abstain, or ask clarifying questions
Ops
Treat examples and tools as runtime dependencies
Examples, tool definitions, and long reusable prefixes affect latency, cost, and caching strategy. They are not free.
Keep reusable prefixes stable
Measure cost and latency per version
Audit example quality like training data

Reference template

This is a better default than a giant paragraph. It forces the model contract into visible sections and makes later debugging much easier.

Reference template
Prompt spec with visible sections
A visible contract is easier to review, cache, version, and debug than a single giant paragraph.
xml
<task>
Summarize the incident and produce a remediation plan for an internal engineering audience.
</task>

<success_criteria>
- Preserve material facts.
- Distinguish confirmed facts from inferred causes.
- Output must fit the JSON schema below.
- If evidence is missing, return "needs_followup": true.
</success_criteria>

<context>
<incident_report>{{incident_report}}</incident_report>
<system_constraints>
- Do not fabricate timestamps.
- Do not cite information not present in the report.
</system_constraints>
</context>

<examples>
<example>
<input>...</input>
<output>...</output>
</example>
</examples>

<output_contract>
{
  "summary": "string",
  "root_cause": "string | null",
  "remediations": ["string"],
  "needs_followup": "boolean"
}
</output_contract>

4. Model-specific tactics that actually matter

OpenAI reasoning models
Keep instructions short and outcome-driven
Use the developer message as the top-level control surface. OpenAI recommends simple prompts, clear constraints, delimiters, and zero-shot first. Do not automatically ask for chain-of-thought.
GPT-style generation models
Lean harder on examples and response structure
Use examples when output consistency matters, keep tone guidance centralized, and enforce structured outputs early when you need parser-safe behavior.
Claude 4 and long-context flows
Be explicit, tagged, and deliberate about context
Anthropic guidance emphasizes direct instructions, XML tags for multi-part prompts, and carefully chosen examples. For long-context tasks, structure documents and metadata clearly.
Operator note
Use examples surgically
Anthropic's multishot guidance is strong, but that does not mean examples belong in every prompt. Add examples when the output shape, style, or edge-case behavior needs a canonical pattern. Otherwise, keep the prompt lean and spend the token budget on better context or better eval coverage.

5. Context windows, caching, and version control

Teams lose a lot of money and latency because they treat every request as a fresh prompt. The latest platform guidance is clear: organize reusable prefixes intentionally, keep static content stable, and version prompts so you can compare behavior across releases.

Prompt objects
Version and roll back

OpenAI now treats prompts as long-lived assets with versions and variables.

Caching
Stabilize the reusable prefix

Both OpenAI and Anthropic prompt caching reward static instructions and examples that do not churn.

Long context
Structure documents clearly

Anthropic recommends tagging long documents and grounding outputs in relevant quotes for document-heavy tasks.

Prompt optimizer
Use graders as fuel

OpenAI prompt optimization works best when annotations and grader critiques are specific.

6. Triage failures by category, not by intuition

Specification failure
The prompt never defined success
Symptoms: drifting tone, wrong level of detail, inconsistent refusals. Fix by tightening success criteria, not by adding more adjectives.
Context failure
The model got the wrong evidence
Symptoms: hallucinated facts, shallow retrieval, inconsistent citations. Fix document placement, source boundaries, and retrieval quality.
Format failure
The output contract is too soft
Symptoms: malformed JSON, missing fields, prose where a schema was expected. Fix with stricter contracts and grader checks.
Evaluation failure
The prompt changed, but nothing measured it
Symptoms: silent regressions, subjective debates, version sprawl. Fix by tying prompt releases to evals and annotated failures.

7. Sharp release checklist

  • Write the prompt as a spec with visible sections, not as a paragraph of vibes.
  • Use the simplest model-appropriate strategy first: zero-shot for reasoning models, examples only when needed.
  • Separate reusable instructions, documents, examples, and runtime input cleanly.
  • Define a strict output contract before you ship the prompt into an app or agent.
  • Version prompt changes and compare them against the previous release with linked evals.
  • Measure cost, latency, and failure categories, not just pass rate.
  • Keep a rollback path ready whenever you publish a new prompt version.
Where to go next
Pair prompt work with evals and agent design
Prompt quality compounds when it is connected to the rest of the stack. Use Evaluation to design graders, AI Agents to structure workflows, and MCP when the prompt needs tools and external context.
Open the evaluation guide

Test Your Knowledge

beginner

A structured walkthrough for crafting reliable prompts.

3 questions
10 min
70% to pass

Sign in to take this quiz

Create an account to take the quiz, track your progress, and see how you compare with other learners.

Continue exploring

Move laterally within the same track or jump to the next bottleneck in your AI stack.