AI Code Audit & Optimization Agent
This challenge focuses on building an advanced multi-agent system using the OpenAI Agents SDK to analyze and optimize AI-generated code. Inspired by the need for better management of AI-written code, participants will design and implement a team of specialized agents capable of reviewing code for quality, security vulnerabilities, performance bottlenecks, and adherence to best practices. The system will leverage tool-calling capabilities to interact with simulated code environments and external analysis tools. The core task involves orchestrating agents to perform a comprehensive audit of a given Python codebase, identify areas for improvement, and suggest optimized code snippets. This requires careful agent role definition, state management within the OpenAI Assistants API, and robust error handling. The solution should demonstrate sophisticated multi-turn conversation flows and autonomous decision-making in the context of code review.
What you are building
The core problem, expected build, and operating context for this challenge.
This challenge focuses on building an advanced multi-agent system using the OpenAI Agents SDK to analyze and optimize AI-generated code. Inspired by the need for better management of AI-written code, participants will design and implement a team of specialized agents capable of reviewing code for quality, security vulnerabilities, performance bottlenecks, and adherence to best practices. The system will leverage tool-calling capabilities to interact with simulated code environments and external analysis tools. The core task involves orchestrating agents to perform a comprehensive audit of a given Python codebase, identify areas for improvement, and suggest optimized code snippets. This requires careful agent role definition, state management within the OpenAI Assistants API, and robust error handling. The solution should demonstrate sophisticated multi-turn conversation flows and autonomous decision-making in the context of code review.
Shared data for this challenge
Review public datasets and any private uploads tied to your build.
How submissions are scored
These dimensions define what the evaluator checks, how much each dimension matters, and which criteria separate a passable run from a strong one.
JsonFormatCheck
Verify the output is a valid JSON matching the specified schema.
This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.
RelevantFindingsCount
Ensure at least 2 relevant findings (bug, security, or optimization) are identified.
This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.
FindingAccuracy
Percentage of correctly identified issues (bugs, security, optimization) relative to a ground truth. • target: 85 • range: 0-100
This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.
SuggestionRelevance
Score based on the actionable and correct nature of suggested code changes (human-evaluated or via automated checks for simple cases). • target: 4 • range: 1-5
This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.
What you should walk away with
Master the OpenAI Agents SDK for defining agent roles, states, and conversational flows using the Assistants API
Implement advanced tool-calling functions within the OpenAI ecosystem to interface with code linters, security scanners, and performance profilers
Design hierarchical multi-agent workflows where specialized agents (e.g., 'Security Analyst Agent', 'Performance Engineer Agent') collaborate to review and suggest code improvements
Build a code generation and refactoring agent that uses OpenAI o3 to propose optimized code snippets based on audit findings
Integrate Hugging Face Transformers for semantic code search, vulnerability pattern recognition, or code summarization as a specialized agent tool
Utilize Skyvern for automated browsing of external documentation (e.g., language specs, library docs) to provide contextual advice to code agents
Orchestrate a pipeline with StackAI to manage the overall agent workflow, trigger code analyses, and present aggregated reports
Develop effective strategies for managing agent memory and context over long-running code review sessions using OpenAI's state management features
[ok] Wrote CHALLENGE.md
[ok] Wrote .versalist.json
[ok] Wrote eval/examples.json
Requires VERSALIST_API_KEY. Works with any MCP-aware editor.
DocsAI Research & Mentorship
Participation status
You haven't started this challenge yet
Operating window
Key dates and the organization behind this challenge.
Find another challenge
Jump to a random challenge when you want a fresh benchmark or a different problem space.