AI Safety Guardrail System for Generative Content
Inspired by recent concerns regarding the generation of unsafe content by large language models, this challenge tasks developers with building a robust, real-time AI safety guardrail system. The system must actively monitor, evaluate, and, if necessary, block or rephrase outputs from a generative AI model to prevent the proliferation of harmful or nonconsensual content, particularly focusing on multi-modal inputs and outputs. It should leverage an agentic architecture to apply policy-driven moderation.
What you are building
The core problem, expected build, and operating context for this challenge.
Inspired by recent concerns regarding the generation of unsafe content by large language models, this challenge tasks developers with building a robust, real-time AI safety guardrail system. The system must actively monitor, evaluate, and, if necessary, block or rephrase outputs from a generative AI model to prevent the proliferation of harmful or nonconsensual content, particularly focusing on multi-modal inputs and outputs. It should leverage an agentic architecture to apply policy-driven moderation.
Shared data for this challenge
Review public datasets and any private uploads tied to your build.
What you should walk away with
Master CrewAI for building specialized, role-based agent teams dedicated to content analysis, policy enforcement, and output remediation.
Implement multi-modal input processing with OpenAI GPT-5 for analyzing both text prompts and generated images for safety violations.
Design and integrate `Guardrails AI` to define and enforce explicit content policies, ensuring structured and compliant generative outputs.
Build a prompt engineering pipeline using `Weights & Biases (W&B Prompts)` for experimenting with and tracing moderation prompts and model behavior.
Orchestrate a real-time monitoring and feedback loop where `Claude Opus 4.5` acts as a meta-moderator, evaluating the effectiveness of the initial guardrail agents.
Develop strategies for handling edge cases, prompt injection attempts, and adversarial inputs to bypass safety mechanisms.
[ok] Wrote CHALLENGE.md
[ok] Wrote .versalist.json
[ok] Wrote eval/examples.json
Requires VERSALIST_API_KEY. Works with any MCP-aware editor.
DocsAI Research & Mentorship
Participation status
You haven't started this challenge yet
Operating window
Key dates and the organization behind this challenge.
Find another challenge
Jump to a random challenge when you want a fresh benchmark or a different problem space.