Agent Building
Advanced
Always open

'ARC-AGI' Research Agents for Complex Problem Solving

This challenge tasks you with building a collaborative multi-agent system using CrewAI. The goal is to tackle a complex, abstract research problem, simulating an 'ARC-AGI-like' scenario where agents must demonstrate advanced reasoning, synthesis, and problem-solving capabilities to produce an expert-level research report. Your CrewAI setup will include specialized agents such as a 'Problem Decomposer' (Gemini 2.5 Pro), a 'Knowledge Synthesizer' (Gemini 2.5 Pro), and a 'Critical Evaluator' (GPT-4o). These agents will collaborate, utilizing tools like a Pinecone vector database for persistent memory and context management, and leveraging Ray Serve for efficient deployment of custom analysis tools or specialized models. The challenge emphasizes orchestrating sophisticated agent workflows to go beyond simple information retrieval and truly engage in deep, structured reasoning to arrive at novel insights and solutions, with the final output evaluated for its comprehensiveness and conceptual depth.

Challenge brief

What you are building

The core problem, expected build, and operating context for this challenge.

This challenge tasks you with building a collaborative multi-agent system using CrewAI. The goal is to tackle a complex, abstract research problem, simulating an 'ARC-AGI-like' scenario where agents must demonstrate advanced reasoning, synthesis, and problem-solving capabilities to produce an expert-level research report. Your CrewAI setup will include specialized agents such as a 'Problem Decomposer' (Gemini 2.5 Pro), a 'Knowledge Synthesizer' (Gemini 2.5 Pro), and a 'Critical Evaluator' (GPT-4o). These agents will collaborate, utilizing tools like a Pinecone vector database for persistent memory and context management, and leveraging Ray Serve for efficient deployment of custom analysis tools or specialized models. The challenge emphasizes orchestrating sophisticated agent workflows to go beyond simple information retrieval and truly engage in deep, structured reasoning to arrive at novel insights and solutions, with the final output evaluated for its comprehensiveness and conceptual depth.

Datasets

Shared data for this challenge

Review public datasets and any private uploads tied to your build.

Loading datasets...
Evaluation rubric

How submissions are scored

These dimensions define what the evaluator checks, how much each dimension matters, and which criteria separate a passable run from a strong one.

Max Score: 5
Dimensions
5 scoring checks
Binary
5 pass or fail dimensions
Ordinal
0 scaled dimensions
Dimension 1reportstructurecompliance

ReportStructureCompliance

Ensures the generated report contains all required sections.

binary
Weight: 1
Binary check

This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.

Dimension 2solutionplausibility

SolutionPlausibility

Checks if the core solution presented is logically sound given the problem constraints.

binary
Weight: 1
Binary check

This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.

Dimension 3conceptual_depth_score

Conceptual_Depth_Score

A human-rated score (1-5) assessing the profundity and insightfulness of the analysis. • target: 4 • range: 1-5

binary
Weight: 1
Binary check

This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.

Dimension 4logical_consistency_score

Logical_Consistency_Score

A human-rated score (1-5) evaluating the coherence and logical flow throughout the report. • target: 4 • range: 1-5

binary
Weight: 1
Binary check

This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.

Dimension 5problem_coverage_ratio

Problem_Coverage_Ratio

Automated metric: Percentage of problem facets explicitly addressed in the report. • target: 0.85 • range: 0.6-1

binary
Weight: 1
Binary check

This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.

Learning goals

What you should walk away with

Master CrewAI for defining roles, tasks, and hierarchical collaboration among specialized AI agents to solve complex, multi-faceted problems.

Utilize Gemini 2.5 Pro for advanced reasoning, problem decomposition, and knowledge synthesis tasks, leveraging its deep thinking and multi-modal capabilities if applicable.

Integrate GPT-4o as a specialized 'Critical Evaluator' agent within the CrewAI framework, focusing on its nuanced understanding for critique and refinement of research outputs.

Design and implement a long-term memory system for agents using Pinecone vector database, enabling persistent context, concept retrieval, and structured knowledge management.

Develop and deploy custom tool-using agents or specialized model endpoints using Ray Serve, allowing agents to access domain-specific functions or advanced analytical capabilities.

Orchestrate iterative research and refinement cycles within CrewAI, demonstrating how agents can collaborate to explore hypotheses, synthesize findings, and critically evaluate their own outputs to converge on high-quality solutions.

Start from your terminal
$npx -y @versalist/cli start arc-agi-research-agents-for-complex-problem-solving

[ok] Wrote CHALLENGE.md

[ok] Wrote .versalist.json

[ok] Wrote eval/examples.json

Requires VERSALIST_API_KEY. Works with any MCP-aware editor.

Docs
Manage API keys
Challenge at a glance
Host and timing
Vera

AI Research & Mentorship

Starts Available now
Evergreen challenge
Your progress

Participation status

You haven't started this challenge yet

Timeline and host

Operating window

Key dates and the organization behind this challenge.

Start date
Available now
Run mode
Evergreen challenge
Explore

Find another challenge

Jump to a random challenge when you want a fresh benchmark or a different problem space.

Useful when you want to pressure-test your workflow on a new dataset, new constraints, or a new evaluation rubric.

Tool Space Recipe

Draft
Evaluation
Rubric: 5 dimensions
·ReportStructureCompliance(1%)
·SolutionPlausibility(1%)
·Conceptual_Depth_Score(1%)
·Logical_Consistency_Score(1%)
·Problem_Coverage_Ratio(1%)
Gold items: 1 (1 public)

Frequently Asked Questions about 'ARC-AGI' Research Agents for Complex Problem Solving