implementation

Develop Multimodal Response Evaluation Agent

Inspect the original prompt language first, then copy or adapt it once you know how it fits your workflow.

Linked challenge: Build Multimodal Adversarial Benchmarking Agents

Format

Text-first

Lines

Sections

Linked challenge

Build Multimodal Adversarial Benchmarking Agents

Prompt source

Original prompt text with formatting preserved for inspection.

1 lines

1 sections

No variables

0 checklist items

Develop the 'Model Response Evaluator' agent. This agent will receive multimodal prompts and model responses (e.g., from Ernie 5.0 or Gemini 2.5 Pro) via A2A Protocol. Implement logic using DSPy to critically assess the quality, accuracy, coherence, and safety of the multimodal responses. Define a scoring mechanism and provide a textual justification for the scores. Integrate it to receive input from the 'Adversary Prompt Generator'.

Adaptation plan

Keep the source stable, then change the prompt in a predictable order so the next run is easier to evaluate.

Keep stable

Hold the task contract and output shape stable so generated implementations remain comparable.

Tune next

Update libraries, interfaces, and environment assumptions to match the stack you actually run.

Verify after

Test failure handling, edge cases, and any code paths that depend on hidden context or secrets.

Prompt diagnostics

Variables

Lists

Code blocks

Purpose

implementation

This prompt is mostly narrative and instruction-driven, so adapt examples and output constraints before you rewrite the structure.

Linked challenge

Build Multimodal Adversarial Benchmarking Agents

This challenge focuses on developing an advanced agent system to perform rigorous, adversarial evaluation of cutting-edge multimodal Large Language Models (LLMs). Participants will design and implement a multi-agent system capable of generating complex, multimodal prompts (text, image, audio) and then evaluating the responses from different LLMs for accuracy, coherence, and robustness against adversarial attacks. The system will leverage the A2A (Agent-to-Agent) Protocol for seamless communication between an adversarial prompt generation agent and an evaluation agent. DSPy will be used to programmatically optimize the adversarial prompt generation and response analysis, ensuring the benchmark is adaptive and thorough. This challenge emphasizes the practical application of agentic AI for quality assurance and objective comparison of next-generation generative models.

Open challenge

Related prompts

Browse library

Design A2A Adversarial Benchmarking Architecture

planning

Implement Multimodal Prompt Generation Agent

implementation

Conduct Comparative Benchmark Test

testing