testing

Evaluation Harness Integration

Inspect the original prompt language first, then copy or adapt it once you know how it fits your workflow.

Linked challenge: Agent for Robotaxi Safety Policy Analysis & Dynamic Procedure Generation

Format

Code-aware

Lines

Sections

Linked challenge

Agent for Robotaxi Safety Policy Analysis & Dynamic Procedure Generation

Prompt source

Original prompt text with formatting preserved for inspection.

1 lines

1 sections

No variables

0 checklist items

Describe how you would integrate an evaluation harness (like a custom Python script or LangChain Eval) to automatically test the safety and compliance of the generated procedures. Focus on defining metrics for correctness, specificity, and adherence to regulatory standards. Provide a conceptual outline or pseudo-code for a test function that would check if a generated procedure addresses a specific identified root cause or complies with a new regulation.

Adaptation plan

Keep the source stable, then change the prompt in a predictable order so the next run is easier to evaluate.

Keep stable

Preserve the rubric, target behavior, and pass-fail criteria as the baseline for evaluation.

Tune next

Adjust fixtures, mocks, and thresholds to the system under test instead of weakening the assertions.

Verify after

Make sure the prompt catches regressions instead of just mirroring the happy-path examples.

Prompt diagnostics

Variables

Lists

Code blocks

Purpose

testing

This prompt already mixes executable detail with instructions, so tune examples and interfaces before rewriting the scaffold.

Linked challenge

Agent for Robotaxi Safety Policy Analysis & Dynamic Procedure Generation

Develop an advanced agent using Anthropic's Claude Agents SDK to analyze real-time robotaxi incident data, cross-reference it with complex safety regulations, and dynamically generate or adapt operational procedures. Inspired by the Waymo incident and upcoming UK regulations, this challenge focuses on building a highly reliable, safety-critical agent that can interpret regulatory documents, learn from incidents, and output actionable safety protocols. The agent will leverage Claude Opus's extended thinking and 'computer use' capabilities to process vast amounts of unstructured text and adapt to evolving regulatory landscapes.

Open challenge

Related prompts

Browse library

Claude Agent SDK Initialization and Tool Definition

implementation

Incident Analysis Workflow with Claude Opus 4.1

planning

'Computer Use' for Policy Interpretation

implementation