Agent for Complex Policy & Contract Analysis
Develop a Claude Agent using the Claude Agents SDK capable of dissecting and analyzing complex legal or business policy documents, drawing inspiration from the startup Ivo's approach to breaking down legal reviews. The agent will focus on reducing 'hallucinations' by performing granular task decomposition and leveraging contextual retrieval. It will use Claude Opus 4.1 for sophisticated reasoning, Weaviate for efficient semantic search over a corpus of policy documents (RAG approach), and Prefect for orchestrating document ingestion workflows. A simple Gradio interface will allow users to submit documents for analysis.
What you are building
The core problem, expected build, and operating context for this challenge.
Develop a Claude Agent using the Claude Agents SDK capable of dissecting and analyzing complex legal or business policy documents, drawing inspiration from the startup Ivo's approach to breaking down legal reviews. The agent will focus on reducing 'hallucinations' by performing granular task decomposition and leveraging contextual retrieval. It will use Claude Opus 4.1 for sophisticated reasoning, Weaviate for efficient semantic search over a corpus of policy documents (RAG approach), and Prefect for orchestrating document ingestion workflows. A simple Gradio interface will allow users to submit documents for analysis.
Shared data for this challenge
Review public datasets and any private uploads tied to your build.
How submissions are scored
These dimensions define what the evaluator checks, how much each dimension matters, and which criteria separate a passable run from a strong one.
Correct Clause Identification
Checks if at least two relevant clauses are identified and summarized.
This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.
Hallucination Absence
Verifies that the agent self-reports no hallucinations, or an external check confirms veracity.
This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.
Semantic Accuracy of Summaries
Measures how accurately clause summaries reflect the original text (0-100). • target: 85 • range: 0-100
This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.
Completeness of Implications
Evaluates how comprehensively the agent identifies implications for each clause (0-100). • target: 80 • range: 0-100
This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.
What you should walk away with
Master the **Claude Agents SDK** for defining tools, coordinating multi-step reasoning, and leveraging its 'computer use' capabilities for document interaction.
Utilize **Claude Opus 4.1** for its superior long-context understanding and complex reasoning in legal and policy interpretation.
Implement a **Weaviate** vector database for storing and semantically searching policy documents, forming the core of the RAG system.
Design and deploy data pipelines with **Prefect** for automated ingestion, chunking, embedding, and indexing of new policy documents into Weaviate.
Build custom tools for the Claude Agent to interact with Weaviate (e.g., `retrieve_relevant_clauses(query: str)`) and perform document summarization.
Develop a simple web interface using **Gradio** for users to upload policy documents and receive structured analysis results.
Implement strategies for granular task decomposition, breaking down complex analysis into smaller, verifiable steps to minimize hallucinations, inspired by the Ivo approach.
[ok] Wrote CHALLENGE.md
[ok] Wrote .versalist.json
[ok] Wrote eval/examples.json
Requires VERSALIST_API_KEY. Works with any MCP-aware editor.
DocsAI Research & Mentorship
Participation status
You haven't started this challenge yet
Operating window
Key dates and the organization behind this challenge.
Find another challenge
Jump to a random challenge when you want a fresh benchmark or a different problem space.