Agent for Robotaxi Safety Policy Analysis & Dynamic Procedure Generation
Develop an advanced agent using Anthropic's Claude Agents SDK to analyze real-time robotaxi incident data, cross-reference it with complex safety regulations, and dynamically generate or adapt operational procedures. Inspired by the Waymo incident and upcoming UK regulations, this challenge focuses on building a highly reliable, safety-critical agent that can interpret regulatory documents, learn from incidents, and output actionable safety protocols. The agent will leverage Claude Opus's extended thinking and 'computer use' capabilities to process vast amounts of unstructured text and adapt to evolving regulatory landscapes.
What you are building
The core problem, expected build, and operating context for this challenge.
Develop an advanced agent using Anthropic's Claude Agents SDK to analyze real-time robotaxi incident data, cross-reference it with complex safety regulations, and dynamically generate or adapt operational procedures. Inspired by the Waymo incident and upcoming UK regulations, this challenge focuses on building a highly reliable, safety-critical agent that can interpret regulatory documents, learn from incidents, and output actionable safety protocols. The agent will leverage Claude Opus's extended thinking and 'computer use' capabilities to process vast amounts of unstructured text and adapt to evolving regulatory landscapes.
Shared data for this challenge
Review public datasets and any private uploads tied to your build.
How submissions are scored
These dimensions define what the evaluator checks, how much each dimension matters, and which criteria separate a passable run from a strong one.
PolicyReferenceCorrectness
Verifies that generated procedures correctly reference the relevant safety policies or regulations.
This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.
ProcedureSafetyStandard
Checks if the generated procedure includes specific, quantifiable safety improvements.
This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.
ReasoningDepthScore
A subjective score (1-5) on the depth of the agent's root cause analysis and understanding of policy implications. • target: 4 • range: 1-5
This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.
ProcedureSpecificityScore
A score (1-5) on how specific, actionable, and unambiguous the generated procedures are. • target: 4 • range: 1-5
This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.
What you should walk away with
Master the Claude Agents SDK for defining agent capabilities, tools, and orchestrating complex multi-step reasoning processes.
Utilize Claude Opus 4.1 for advanced text comprehension, policy interpretation, and generating nuanced operational procedures.
Implement 'computer use' tools within the Claude Agent for tasks such as simulating incident scenarios, extracting specific clauses from regulatory PDFs, and querying external knowledge bases.
Develop custom tools for the Claude Agent to interface with a simulated real-time incident stream and a database of safety policies.
Design a robust AI evaluation harness (e.g., using LangChain Eval or custom Python scripts) to assess the correctness, completeness, and safety of generated procedures.
Orchestrate the agent's decision-making to dynamically adapt safety protocols based on new incident data or updated regulations, ensuring compliance and continuous improvement.
[ok] Wrote CHALLENGE.md
[ok] Wrote .versalist.json
[ok] Wrote eval/examples.json
Requires VERSALIST_API_KEY. Works with any MCP-aware editor.
DocsAI Research & Mentorship
Participation status
You haven't started this challenge yet
Operating window
Key dates and the organization behind this challenge.
Find another challenge
Jump to a random challenge when you want a fresh benchmark or a different problem space.