Workflow Automation
Advanced
Always open

Agent for Complex Policy & Contract Analysis

Develop a Claude Agent using the Claude Agents SDK capable of dissecting and analyzing complex legal or business policy documents, drawing inspiration from the startup Ivo's approach to breaking down legal reviews. The agent will focus on reducing 'hallucinations' by performing granular task decomposition and leveraging contextual retrieval. It will use Claude Opus 4.1 for sophisticated reasoning, Weaviate for efficient semantic search over a corpus of policy documents (RAG approach), and Prefect for orchestrating document ingestion workflows. A simple Gradio interface will allow users to submit documents for analysis.

Challenge brief

What you are building

The core problem, expected build, and operating context for this challenge.

Develop a Claude Agent using the Claude Agents SDK capable of dissecting and analyzing complex legal or business policy documents, drawing inspiration from the startup Ivo's approach to breaking down legal reviews. The agent will focus on reducing 'hallucinations' by performing granular task decomposition and leveraging contextual retrieval. It will use Claude Opus 4.1 for sophisticated reasoning, Weaviate for efficient semantic search over a corpus of policy documents (RAG approach), and Prefect for orchestrating document ingestion workflows. A simple Gradio interface will allow users to submit documents for analysis.

Datasets

Shared data for this challenge

Review public datasets and any private uploads tied to your build.

Loading datasets...
Evaluation rubric

How submissions are scored

These dimensions define what the evaluator checks, how much each dimension matters, and which criteria separate a passable run from a strong one.

Max Score: 4
Dimensions
4 scoring checks
Binary
4 pass or fail dimensions
Ordinal
0 scaled dimensions
Dimension 1correct_clause_identification

Correct Clause Identification

Checks if at least two relevant clauses are identified and summarized.

binary
Weight: 1
Binary check

This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.

Dimension 2hallucination_absence

Hallucination Absence

Verifies that the agent self-reports no hallucinations, or an external check confirms veracity.

binary
Weight: 1
Binary check

This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.

Dimension 3semantic_accuracy_of_summaries

Semantic Accuracy of Summaries

Measures how accurately clause summaries reflect the original text (0-100). • target: 85 • range: 0-100

binary
Weight: 1
Binary check

This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.

Dimension 4completeness_of_implications

Completeness of Implications

Evaluates how comprehensively the agent identifies implications for each clause (0-100). • target: 80 • range: 0-100

binary
Weight: 1
Binary check

This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.

Learning goals

What you should walk away with

Master the **Claude Agents SDK** for defining tools, coordinating multi-step reasoning, and leveraging its 'computer use' capabilities for document interaction.

Utilize **Claude Opus 4.1** for its superior long-context understanding and complex reasoning in legal and policy interpretation.

Implement a **Weaviate** vector database for storing and semantically searching policy documents, forming the core of the RAG system.

Design and deploy data pipelines with **Prefect** for automated ingestion, chunking, embedding, and indexing of new policy documents into Weaviate.

Build custom tools for the Claude Agent to interact with Weaviate (e.g., `retrieve_relevant_clauses(query: str)`) and perform document summarization.

Develop a simple web interface using **Gradio** for users to upload policy documents and receive structured analysis results.

Implement strategies for granular task decomposition, breaking down complex analysis into smaller, verifiable steps to minimize hallucinations, inspired by the Ivo approach.

Start from your terminal
$npx -y @versalist/cli start agent-for-complex-policy-contract-analysis

[ok] Wrote CHALLENGE.md

[ok] Wrote .versalist.json

[ok] Wrote eval/examples.json

Requires VERSALIST_API_KEY. Works with any MCP-aware editor.

Docs
Manage API keys
Challenge at a glance
Host and timing
Vera

AI Research & Mentorship

Starts Available now
Evergreen challenge
Your progress

Participation status

You haven't started this challenge yet

Timeline and host

Operating window

Key dates and the organization behind this challenge.

Start date
Available now
Run mode
Evergreen challenge
Explore

Find another challenge

Jump to a random challenge when you want a fresh benchmark or a different problem space.

Useful when you want to pressure-test your workflow on a new dataset, new constraints, or a new evaluation rubric.

Tool Space Recipe

Draft
Evaluation
Rubric: 4 dimensions
·Correct Clause Identification(1%)
·Hallucination Absence(1%)
·Semantic Accuracy of Summaries(1%)
·Completeness of Implications(1%)
Gold items: 1 (1 public)

Frequently Asked Questions about Agent for Complex Policy & Contract Analysis