Challenge

Autonomous Security Operator Agents SDK

For general security intelligence, this challenge requires building an autonomous agent capable of identifying and fixing vulnerabilities in real-time. You will use the OpenAI Agents SDK to build a multi-turn operator that utilizes function calling to interface with Proxis AI for engineering diagnostics and Vellum for managing agent prompts and prompt variants. The agent must perform multi-step reasoning using the GPT-5.3-Codex model to simulate a human security engineer navigating complex codebases and Slack-based incident reports.

Agent BuildingHosted by Vera
Status
Always open
Difficulty
Advanced
Points
500
Challenge brief

What you are building

The core problem, expected build, and operating context for this challenge.

For general security intelligence, this challenge requires building an autonomous agent capable of identifying and fixing vulnerabilities in real-time. You will use the OpenAI Agents SDK to build a multi-turn operator that utilizes function calling to interface with Proxis AI for engineering diagnostics and Vellum for managing agent prompts and prompt variants. The agent must perform multi-step reasoning using the GPT-5.3-Codex model to simulate a human security engineer navigating complex codebases and Slack-based incident reports.

Datasets

Shared data for this challenge

Review public datasets and any private uploads tied to your build.

Loading datasets...
Evaluation rubric

How submissions are scored

These dimensions define what the evaluator checks, how much each dimension matters, and which criteria separate a passable run from a strong one.

Max Score: 2
Dimensions
2 scoring checks
Binary
2 pass or fail dimensions
Ordinal
0 scaled dimensions
Dimension 1security_validation

Security Validation

Patch must eliminate the specific vulnerability without introducing new ones.

binary
Weight: 1
Binary check

This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.

Dimension 2remediation_speed

Remediation Speed

Time taken from ingestion to patch generation • target: 60 • range: 0-300

binary
Weight: 1
Binary check

This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.

Learning goals

What you should walk away with

  • Orchestrate OpenAI Agents SDK role-based teams with specialized agents for vulnerability detection and automated patching

  • Leverage the GPT-5.3-Codex model for advanced multi-step reasoning in security-sensitive contexts

  • Design autonomous tool-calling loops that allow agents to browse file systems and execute Proxis AI diagnostic tools

  • Implement Libretto to evaluate agent performance and route requests based on latency and security scoring

  • Integrate Cartesia voice interfaces to provide real-time audio status updates for security operations centers

  • Master the hand-off pattern in OpenAI Agents SDK to transition from detection agents to remediation specialists

Start from your terminal
$npx -y @versalist/cli start autonomous-security-operator-agents-sdk

[ok] Wrote CHALLENGE.md

[ok] Wrote .versalist.json

[ok] Wrote eval/examples.json

Requires VERSALIST_API_KEY. Works with any MCP-aware editor.

Docs
Manage API keys
Host and timing
Vera

AI Research & Mentorship

Starts Available now
Evergreen challenge
Your progress

Participation status

You haven't started this challenge yet

Timeline and host

Operating window

Key dates and the organization behind this challenge.

Start date
Available now
Run mode
Evergreen challenge
Explore

Find another challenge

Jump to a random challenge when you want a fresh benchmark or a different problem space.

Useful when you want to pressure-test your workflow on a new dataset, new constraints, or a new evaluation rubric.

Tool Space Recipe

Draft
Action Space
OpenAIOpenAI AI model provider
required
VellumLLM development and testing platform
Policy Serving
GPT-5
Evaluation
Rubric: 2 dimensions
·Security Validation(1%)
·Remediation Speed(1%)
Gold items: 1 (1 public)

Frequently Asked Questions about Autonomous Security Operator Agents SDK