'ARC-AGI' Research Agents for Complex Problem Solving
This challenge tasks you with building a collaborative multi-agent system using CrewAI. The goal is to tackle a complex, abstract research problem, simulating an 'ARC-AGI-like' scenario where agents must demonstrate advanced reasoning, synthesis, and problem-solving capabilities to produce an expert-level research report. Your CrewAI setup will include specialized agents such as a 'Problem Decomposer' (Gemini 2.5 Pro), a 'Knowledge Synthesizer' (Gemini 2.5 Pro), and a 'Critical Evaluator' (GPT-4o). These agents will collaborate, utilizing tools like a Pinecone vector database for persistent memory and context management, and leveraging Ray Serve for efficient deployment of custom analysis tools or specialized models. The challenge emphasizes orchestrating sophisticated agent workflows to go beyond simple information retrieval and truly engage in deep, structured reasoning to arrive at novel insights and solutions, with the final output evaluated for its comprehensiveness and conceptual depth.
What you are building
The core problem, expected build, and operating context for this challenge.
This challenge tasks you with building a collaborative multi-agent system using CrewAI. The goal is to tackle a complex, abstract research problem, simulating an 'ARC-AGI-like' scenario where agents must demonstrate advanced reasoning, synthesis, and problem-solving capabilities to produce an expert-level research report. Your CrewAI setup will include specialized agents such as a 'Problem Decomposer' (Gemini 2.5 Pro), a 'Knowledge Synthesizer' (Gemini 2.5 Pro), and a 'Critical Evaluator' (GPT-4o). These agents will collaborate, utilizing tools like a Pinecone vector database for persistent memory and context management, and leveraging Ray Serve for efficient deployment of custom analysis tools or specialized models. The challenge emphasizes orchestrating sophisticated agent workflows to go beyond simple information retrieval and truly engage in deep, structured reasoning to arrive at novel insights and solutions, with the final output evaluated for its comprehensiveness and conceptual depth.
Shared data for this challenge
Review public datasets and any private uploads tied to your build.
How submissions are scored
These dimensions define what the evaluator checks, how much each dimension matters, and which criteria separate a passable run from a strong one.
ReportStructureCompliance
Ensures the generated report contains all required sections.
This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.
SolutionPlausibility
Checks if the core solution presented is logically sound given the problem constraints.
This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.
Conceptual_Depth_Score
A human-rated score (1-5) assessing the profundity and insightfulness of the analysis. • target: 4 • range: 1-5
This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.
Logical_Consistency_Score
A human-rated score (1-5) evaluating the coherence and logical flow throughout the report. • target: 4 • range: 1-5
This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.
Problem_Coverage_Ratio
Automated metric: Percentage of problem facets explicitly addressed in the report. • target: 0.85 • range: 0.6-1
This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.
What you should walk away with
Master CrewAI for defining roles, tasks, and hierarchical collaboration among specialized AI agents to solve complex, multi-faceted problems.
Utilize Gemini 2.5 Pro for advanced reasoning, problem decomposition, and knowledge synthesis tasks, leveraging its deep thinking and multi-modal capabilities if applicable.
Integrate GPT-4o as a specialized 'Critical Evaluator' agent within the CrewAI framework, focusing on its nuanced understanding for critique and refinement of research outputs.
Design and implement a long-term memory system for agents using Pinecone vector database, enabling persistent context, concept retrieval, and structured knowledge management.
Develop and deploy custom tool-using agents or specialized model endpoints using Ray Serve, allowing agents to access domain-specific functions or advanced analytical capabilities.
Orchestrate iterative research and refinement cycles within CrewAI, demonstrating how agents can collaborate to explore hypotheses, synthesize findings, and critically evaluate their own outputs to converge on high-quality solutions.
[ok] Wrote CHALLENGE.md
[ok] Wrote .versalist.json
[ok] Wrote eval/examples.json
Requires VERSALIST_API_KEY. Works with any MCP-aware editor.
DocsAI Research & Mentorship
Participation status
You haven't started this challenge yet
Operating window
Key dates and the organization behind this challenge.
Find another challenge
Jump to a random challenge when you want a fresh benchmark or a different problem space.