Multi-Agent System for Automated Audit Evidence Collection
Develop a sophisticated multi-agent system using Microsoft's AutoGen framework to automate the collection and initial analysis of financial audit evidence. This challenge focuses on creating a team of specialized AI agents that can autonomously navigate public financial documents, extract relevant data, reconcile inconsistencies, and present findings in a structured format. The system should mimic the workflow of junior auditors, but with AI-driven efficiency and consistency, leveraging advanced LLM capabilities for reasoning and information synthesis. The final output should be a summary report highlighting key extracted data points and any identified discrepancies, preparing the ground for human oversight. This project will involve designing conversational agent roles, defining their communication protocols within AutoGen, and integrating external tools for data access and long-term memory. It emphasizes practical application in a business context, showcasing how generative AI can streamline complex, data-intensive tasks in financial services.
What you are building
The core problem, expected build, and operating context for this challenge.
Develop a sophisticated multi-agent system using Microsoft's AutoGen framework to automate the collection and initial analysis of financial audit evidence. This challenge focuses on creating a team of specialized AI agents that can autonomously navigate public financial documents, extract relevant data, reconcile inconsistencies, and present findings in a structured format. The system should mimic the workflow of junior auditors, but with AI-driven efficiency and consistency, leveraging advanced LLM capabilities for reasoning and information synthesis. The final output should be a summary report highlighting key extracted data points and any identified discrepancies, preparing the ground for human oversight. This project will involve designing conversational agent roles, defining their communication protocols within AutoGen, and integrating external tools for data access and long-term memory. It emphasizes practical application in a business context, showcasing how generative AI can streamline complex, data-intensive tasks in financial services.
Shared data for this challenge
Review public datasets and any private uploads tied to your build.
How submissions are scored
These dimensions define what the evaluator checks, how much each dimension matters, and which criteria separate a passable run from a strong one.
SchemaValidation
Output JSON adheres to the specified schema.
This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.
RevenueExtractionAccuracy
Extracted revenue is within 5% of actual value.
This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.
DiscrepancyDetectionRate
Percentage of actual discrepancies correctly identified. • target: 0.8 • range: 0-1
This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.
ResponseTime
Time taken to generate the full report in seconds. • target: 90 • range: 30-300
This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.
What you should walk away with
Master AutoGen for defining conversational agent roles, skill sets, and inter-agent communication protocols to simulate an audit team.
Implement data acquisition agents using Bright Data's web scraping APIs to gather financial reports and public disclosures from designated sources.
Design an analysis agent that leverages Mistral Large's advanced reasoning capabilities to extract, categorize, and reconcile financial data points.
Integrate Pinecone as a long-term memory store for agents, allowing them to recall previously processed information and maintain context across tasks.
Build a 'Reviewer Agent' within AutoGen that validates extracted information and flags potential inconsistencies for human review, using structured output generation.
Orchestrate complex, multi-step workflows within AutoGen, including dynamic task assignment and conditional execution based on intermediate results.
Utilize Libretto for monitoring agent interactions and tracing decision-making paths, enabling robust debugging and performance evaluation of the multi-agent system.
[ok] Wrote CHALLENGE.md
[ok] Wrote .versalist.json
[ok] Wrote eval/examples.json
Requires VERSALIST_API_KEY. Works with any MCP-aware editor.
DocsAI Research & Mentorship
Participation status
You haven't started this challenge yet
Operating window
Key dates and the organization behind this challenge.
Find another challenge
Jump to a random challenge when you want a fresh benchmark or a different problem space.