Automated Generative AI Model Evaluation
This challenge focuses on building a sophisticated, automated evaluation harness for generative AI models. Instead of relying solely on quantitative metrics, the system will employ a team of AI agents, powered by CrewAI and Claude Opus 4.5, to qualitatively assess diverse generative outputs like creative writing, complex code snippets, or synthetic media descriptions. The system needs to generate comprehensive evaluation reports, track experiments, and orchestrate evaluation workflows for continuous benchmarking.
What you are building
The core problem, expected build, and operating context for this challenge.
This challenge focuses on building a sophisticated, automated evaluation harness for generative AI models. Instead of relying solely on quantitative metrics, the system will employ a team of AI agents, powered by CrewAI and Claude Opus 4.5, to qualitatively assess diverse generative outputs like creative writing, complex code snippets, or synthetic media descriptions. The system needs to generate comprehensive evaluation reports, track experiments, and orchestrate evaluation workflows for continuous benchmarking.
Shared data for this challenge
Review public datasets and any private uploads tied to your build.
What you should walk away with
Master the `CrewAI` framework for creating role-based agent teams, defining tasks, and managing agent collaboration to perform complex generative AI evaluations.
Utilize `Claude Opus 4.5` as the 'Expert Critic' agent, leveraging its advanced reasoning and comprehension capabilities to provide nuanced, human-like qualitative assessments of generated content (e.g., creativity, coherence, factual accuracy, safety).
Implement persistent memory and context management for evaluation tasks using `ChromaDB`, storing and retrieving historical test cases, model outputs, and expert agent feedback for continuous learning and contextual evaluation.
Integrate `MLflow` for comprehensive experiment tracking, logging evaluation metrics (both quantitative and qualitative from agents), model versions under test, and artifacts (generated content, agent reports).
Orchestrate end-to-end generative AI evaluation workflows using `Prefect`, defining DAGs that include test case generation, model inference, agentic evaluation, metric logging, and report generation, ensuring scalability and reproducibility.
[ok] Wrote CHALLENGE.md
[ok] Wrote .versalist.json
[ok] Wrote eval/examples.json
Requires VERSALIST_API_KEY. Works with any MCP-aware editor.
DocsAI Research & Mentorship
Participation status
You haven't started this challenge yet
Operating window
Key dates and the organization behind this challenge.
Find another challenge
Jump to a random challenge when you want a fresh benchmark or a different problem space.