testing

Conduct Comparative Benchmark Test

Inspect the original prompt language first, then copy or adapt it once you know how it fits your workflow.

Linked challenge: Build Multimodal Adversarial Benchmarking Agents

Format

Text-first

Lines

Sections

Linked challenge

Build Multimodal Adversarial Benchmarking Agents

Prompt source

Original prompt text with formatting preserved for inspection.

1 lines

1 sections

No variables

0 checklist items

Execute a comparative benchmark test. Run your adversarial benchmarking system against both Ernie 5.0 and Gemini 2.5 Pro (using their respective APIs or mock interfaces). Collect and analyze the evaluation scores and justifications for at least 20 unique adversarial prompts. Summarize your findings, highlighting strengths and weaknesses of each model based on your system's output.

Adaptation plan

Keep the source stable, then change the prompt in a predictable order so the next run is easier to evaluate.

Keep stable

Preserve the rubric, target behavior, and pass-fail criteria as the baseline for evaluation.

Tune next

Adjust fixtures, mocks, and thresholds to the system under test instead of weakening the assertions.

Verify after

Make sure the prompt catches regressions instead of just mirroring the happy-path examples.

Prompt diagnostics

Variables

Lists

Code blocks

Purpose

testing

This prompt is mostly narrative and instruction-driven, so adapt examples and output constraints before you rewrite the structure.

Linked challenge

Build Multimodal Adversarial Benchmarking Agents

This challenge focuses on developing an advanced agent system to perform rigorous, adversarial evaluation of cutting-edge multimodal Large Language Models (LLMs). Participants will design and implement a multi-agent system capable of generating complex, multimodal prompts (text, image, audio) and then evaluating the responses from different LLMs for accuracy, coherence, and robustness against adversarial attacks. The system will leverage the A2A (Agent-to-Agent) Protocol for seamless communication between an adversarial prompt generation agent and an evaluation agent. DSPy will be used to programmatically optimize the adversarial prompt generation and response analysis, ensuring the benchmark is adaptive and thorough. This challenge emphasizes the practical application of agentic AI for quality assurance and objective comparison of next-generation generative models.

Open challenge

Related prompts

Browse library

Design A2A Adversarial Benchmarking Architecture

planning

Implement Multimodal Prompt Generation Agent

implementation

Develop Multimodal Response Evaluation Agent

implementation