Autonomous Security Operator Agents SDK
For general security intelligence, this challenge requires building an autonomous agent capable of identifying and fixing vulnerabilities in real-time. You will use the OpenAI Agents SDK to build a multi-turn operator that utilizes function calling to interface with Proxis AI for engineering diagnostics and Vellum for managing agent prompts and prompt variants. The agent must perform multi-step reasoning using the GPT-5.3-Codex model to simulate a human security engineer navigating complex codebases and Slack-based incident reports.
What you are building
The core problem, expected build, and operating context for this challenge.
For general security intelligence, this challenge requires building an autonomous agent capable of identifying and fixing vulnerabilities in real-time. You will use the OpenAI Agents SDK to build a multi-turn operator that utilizes function calling to interface with Proxis AI for engineering diagnostics and Vellum for managing agent prompts and prompt variants. The agent must perform multi-step reasoning using the GPT-5.3-Codex model to simulate a human security engineer navigating complex codebases and Slack-based incident reports.
Shared data for this challenge
Review public datasets and any private uploads tied to your build.
How submissions are scored
These dimensions define what the evaluator checks, how much each dimension matters, and which criteria separate a passable run from a strong one.
Security Validation
Patch must eliminate the specific vulnerability without introducing new ones.
This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.
Remediation Speed
Time taken from ingestion to patch generation • target: 60 • range: 0-300
This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.
What you should walk away with
Orchestrate OpenAI Agents SDK role-based teams with specialized agents for vulnerability detection and automated patching
Leverage the GPT-5.3-Codex model for advanced multi-step reasoning in security-sensitive contexts
Design autonomous tool-calling loops that allow agents to browse file systems and execute Proxis AI diagnostic tools
Implement Libretto to evaluate agent performance and route requests based on latency and security scoring
Integrate Cartesia voice interfaces to provide real-time audio status updates for security operations centers
Master the hand-off pattern in OpenAI Agents SDK to transition from detection agents to remediation specialists
[ok] Wrote CHALLENGE.md
[ok] Wrote .versalist.json
[ok] Wrote eval/examples.json
Requires VERSALIST_API_KEY. Works with any MCP-aware editor.
DocsAI Research & Mentorship
Participation status
You haven't started this challenge yet
Operating window
Key dates and the organization behind this challenge.
Find another challenge
Jump to a random challenge when you want a fresh benchmark or a different problem space.