Agent Building
Advanced
Always open

AI Code Audit & Optimization Agent

This challenge focuses on building an advanced multi-agent system using the OpenAI Agents SDK to analyze and optimize AI-generated code. Inspired by the need for better management of AI-written code, participants will design and implement a team of specialized agents capable of reviewing code for quality, security vulnerabilities, performance bottlenecks, and adherence to best practices. The system will leverage tool-calling capabilities to interact with simulated code environments and external analysis tools. The core task involves orchestrating agents to perform a comprehensive audit of a given Python codebase, identify areas for improvement, and suggest optimized code snippets. This requires careful agent role definition, state management within the OpenAI Assistants API, and robust error handling. The solution should demonstrate sophisticated multi-turn conversation flows and autonomous decision-making in the context of code review.

Challenge brief

What you are building

The core problem, expected build, and operating context for this challenge.

This challenge focuses on building an advanced multi-agent system using the OpenAI Agents SDK to analyze and optimize AI-generated code. Inspired by the need for better management of AI-written code, participants will design and implement a team of specialized agents capable of reviewing code for quality, security vulnerabilities, performance bottlenecks, and adherence to best practices. The system will leverage tool-calling capabilities to interact with simulated code environments and external analysis tools. The core task involves orchestrating agents to perform a comprehensive audit of a given Python codebase, identify areas for improvement, and suggest optimized code snippets. This requires careful agent role definition, state management within the OpenAI Assistants API, and robust error handling. The solution should demonstrate sophisticated multi-turn conversation flows and autonomous decision-making in the context of code review.

Datasets

Shared data for this challenge

Review public datasets and any private uploads tied to your build.

Loading datasets...
Evaluation rubric

How submissions are scored

These dimensions define what the evaluator checks, how much each dimension matters, and which criteria separate a passable run from a strong one.

Max Score: 4
Dimensions
4 scoring checks
Binary
4 pass or fail dimensions
Ordinal
0 scaled dimensions
Dimension 1jsonformatcheck

JsonFormatCheck

Verify the output is a valid JSON matching the specified schema.

binary
Weight: 1
Binary check

This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.

Dimension 2relevantfindingscount

RelevantFindingsCount

Ensure at least 2 relevant findings (bug, security, or optimization) are identified.

binary
Weight: 1
Binary check

This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.

Dimension 3findingaccuracy

FindingAccuracy

Percentage of correctly identified issues (bugs, security, optimization) relative to a ground truth. • target: 85 • range: 0-100

binary
Weight: 1
Binary check

This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.

Dimension 4suggestionrelevance

SuggestionRelevance

Score based on the actionable and correct nature of suggested code changes (human-evaluated or via automated checks for simple cases). • target: 4 • range: 1-5

binary
Weight: 1
Binary check

This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.

Learning goals

What you should walk away with

Master the OpenAI Agents SDK for defining agent roles, states, and conversational flows using the Assistants API

Implement advanced tool-calling functions within the OpenAI ecosystem to interface with code linters, security scanners, and performance profilers

Design hierarchical multi-agent workflows where specialized agents (e.g., 'Security Analyst Agent', 'Performance Engineer Agent') collaborate to review and suggest code improvements

Build a code generation and refactoring agent that uses OpenAI o3 to propose optimized code snippets based on audit findings

Integrate Hugging Face Transformers for semantic code search, vulnerability pattern recognition, or code summarization as a specialized agent tool

Utilize Skyvern for automated browsing of external documentation (e.g., language specs, library docs) to provide contextual advice to code agents

Orchestrate a pipeline with StackAI to manage the overall agent workflow, trigger code analyses, and present aggregated reports

Develop effective strategies for managing agent memory and context over long-running code review sessions using OpenAI's state management features

Start from your terminal
$npx -y @versalist/cli start ai-code-audit-optimization-agent

[ok] Wrote CHALLENGE.md

[ok] Wrote .versalist.json

[ok] Wrote eval/examples.json

Requires VERSALIST_API_KEY. Works with any MCP-aware editor.

Docs
Manage API keys
Challenge at a glance
Host and timing
Vera

AI Research & Mentorship

Starts Available now
Evergreen challenge
Your progress

Participation status

You haven't started this challenge yet

Timeline and host

Operating window

Key dates and the organization behind this challenge.

Start date
Available now
Run mode
Evergreen challenge
Explore

Find another challenge

Jump to a random challenge when you want a fresh benchmark or a different problem space.

Useful when you want to pressure-test your workflow on a new dataset, new constraints, or a new evaluation rubric.

Tool Space Recipe

Draft
Evaluation
Rubric: 4 dimensions
·JsonFormatCheck(1%)
·RelevantFindingsCount(1%)
·FindingAccuracy(1%)
·SuggestionRelevance(1%)
Gold items: 1 (1 public)

Frequently Asked Questions about AI Code Audit & Optimization Agent