Agent Building
Advanced
Always open

Agent for Robotaxi Safety Policy Analysis & Dynamic Procedure Generation

Develop an advanced agent using Anthropic's Claude Agents SDK to analyze real-time robotaxi incident data, cross-reference it with complex safety regulations, and dynamically generate or adapt operational procedures. Inspired by the Waymo incident and upcoming UK regulations, this challenge focuses on building a highly reliable, safety-critical agent that can interpret regulatory documents, learn from incidents, and output actionable safety protocols. The agent will leverage Claude Opus's extended thinking and 'computer use' capabilities to process vast amounts of unstructured text and adapt to evolving regulatory landscapes.

Challenge brief

What you are building

The core problem, expected build, and operating context for this challenge.

Develop an advanced agent using Anthropic's Claude Agents SDK to analyze real-time robotaxi incident data, cross-reference it with complex safety regulations, and dynamically generate or adapt operational procedures. Inspired by the Waymo incident and upcoming UK regulations, this challenge focuses on building a highly reliable, safety-critical agent that can interpret regulatory documents, learn from incidents, and output actionable safety protocols. The agent will leverage Claude Opus's extended thinking and 'computer use' capabilities to process vast amounts of unstructured text and adapt to evolving regulatory landscapes.

Datasets

Shared data for this challenge

Review public datasets and any private uploads tied to your build.

Loading datasets...
Evaluation rubric

How submissions are scored

These dimensions define what the evaluator checks, how much each dimension matters, and which criteria separate a passable run from a strong one.

Max Score: 4
Dimensions
4 scoring checks
Binary
4 pass or fail dimensions
Ordinal
0 scaled dimensions
Dimension 1policyreferencecorrectness

PolicyReferenceCorrectness

Verifies that generated procedures correctly reference the relevant safety policies or regulations.

binary
Weight: 1
Binary check

This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.

Dimension 2proceduresafetystandard

ProcedureSafetyStandard

Checks if the generated procedure includes specific, quantifiable safety improvements.

binary
Weight: 1
Binary check

This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.

Dimension 3reasoningdepthscore

ReasoningDepthScore

A subjective score (1-5) on the depth of the agent's root cause analysis and understanding of policy implications. • target: 4 • range: 1-5

binary
Weight: 1
Binary check

This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.

Dimension 4procedurespecificityscore

ProcedureSpecificityScore

A score (1-5) on how specific, actionable, and unambiguous the generated procedures are. • target: 4 • range: 1-5

binary
Weight: 1
Binary check

This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.

Learning goals

What you should walk away with

Master the Claude Agents SDK for defining agent capabilities, tools, and orchestrating complex multi-step reasoning processes.

Utilize Claude Opus 4.1 for advanced text comprehension, policy interpretation, and generating nuanced operational procedures.

Implement 'computer use' tools within the Claude Agent for tasks such as simulating incident scenarios, extracting specific clauses from regulatory PDFs, and querying external knowledge bases.

Develop custom tools for the Claude Agent to interface with a simulated real-time incident stream and a database of safety policies.

Design a robust AI evaluation harness (e.g., using LangChain Eval or custom Python scripts) to assess the correctness, completeness, and safety of generated procedures.

Orchestrate the agent's decision-making to dynamically adapt safety protocols based on new incident data or updated regulations, ensuring compliance and continuous improvement.

Start from your terminal
$npx -y @versalist/cli start agent-for-robotaxi-safety-policy-analysis-dynamic-procedure-generation

[ok] Wrote CHALLENGE.md

[ok] Wrote .versalist.json

[ok] Wrote eval/examples.json

Requires VERSALIST_API_KEY. Works with any MCP-aware editor.

Docs
Manage API keys
Challenge at a glance
Host and timing
Vera

AI Research & Mentorship

Starts Available now
Evergreen challenge
Your progress

Participation status

You haven't started this challenge yet

Timeline and host

Operating window

Key dates and the organization behind this challenge.

Start date
Available now
Run mode
Evergreen challenge
Explore

Find another challenge

Jump to a random challenge when you want a fresh benchmark or a different problem space.

Useful when you want to pressure-test your workflow on a new dataset, new constraints, or a new evaluation rubric.

Tool Space Recipe

Draft
Evaluation
Rubric: 4 dimensions
·PolicyReferenceCorrectness(1%)
·ProcedureSafetyStandard(1%)
·ReasoningDepthScore(1%)
·ProcedureSpecificityScore(1%)
Gold items: 2 (2 public)

Frequently Asked Questions about Agent for Robotaxi Safety Policy Analysis & Dynamic Procedure Generation