AI Development
Advanced
Always open

Build Multimodal Adversarial Benchmarking Agents

This challenge focuses on developing an advanced agent system to perform rigorous, adversarial evaluation of cutting-edge multimodal Large Language Models (LLMs). Participants will design and implement a multi-agent system capable of generating complex, multimodal prompts (text, image, audio) and then evaluating the responses from different LLMs for accuracy, coherence, and robustness against adversarial attacks. The system will leverage the A2A (Agent-to-Agent) Protocol for seamless communication between an adversarial prompt generation agent and an evaluation agent. DSPy will be used to programmatically optimize the adversarial prompt generation and response analysis, ensuring the benchmark is adaptive and thorough. This challenge emphasizes the practical application of agentic AI for quality assurance and objective comparison of next-generation generative models.

Status
Always open
Difficulty
Advanced
Points
500
Start the challenge to track prompts, tools, evaluation progress, and leaderboard position in one workspace.
Challenge at a glance
Host and timing
Vera

AI Research & Mentorship

Starts Available now
Evergreen challenge
Challenge brief

What you are building

The core problem, expected build, and operating context for this challenge.

This challenge focuses on developing an advanced agent system to perform rigorous, adversarial evaluation of cutting-edge multimodal Large Language Models (LLMs). Participants will design and implement a multi-agent system capable of generating complex, multimodal prompts (text, image, audio) and then evaluating the responses from different LLMs for accuracy, coherence, and robustness against adversarial attacks. The system will leverage the A2A (Agent-to-Agent) Protocol for seamless communication between an adversarial prompt generation agent and an evaluation agent. DSPy will be used to programmatically optimize the adversarial prompt generation and response analysis, ensuring the benchmark is adaptive and thorough. This challenge emphasizes the practical application of agentic AI for quality assurance and objective comparison of next-generation generative models.

Datasets

Shared data for this challenge

Review public datasets and any private uploads tied to your build.

Loading datasets...
Learning goals

What you should walk away with

Master A2A Protocol for secure and efficient agent-to-agent communication in a distributed system.

Implement advanced multimodal prompt generation strategies using LLM's capabilities for diverse input types.

Design DSPy programs to optimize prompt engineering and model reasoning for adversarial test case creation.

Build automated evaluation metrics for assessing multimodal LLM responses for factual consistency, creativity, and safety.

Orchestrate a comparison framework between GPT 5.1 Pro and Gemini 2.5 Pro, focusing on their performance under stress tests.

Develop hybrid reasoning components to analyze and interpret both numerical scores and qualitative aspects of model outputs.

Your progress

Participation status

You haven't started this challenge yet

Timeline and host

Operating window

Key dates and the organization behind this challenge.

Start date
Available now
Run mode
Evergreen challenge
Explore

Find another challenge

Jump to a random challenge when you want a fresh benchmark or a different problem space.

Useful when you want to pressure-test your workflow on a new dataset, new constraints, or a new evaluation rubric.

Tool Space Recipe

Draft
Evaluation

Frequently Asked Questions about Build Multimodal Adversarial Benchmarking Agents