AI Development
Advanced
Always open

Multimodal Video Intelligence with Qwen3-VL, GPT-5 & LlamaIndex

Inspired by advancements in long-context multimodal understanding, this challenge tasks you with building a cutting-edge video intelligence system. You will integrate the Qwen3-VL model for robust video and image analysis with GPT-5 for higher-level reasoning and synthesis. The system will leverage LlamaIndex for advanced RAG over multimodal data, allowing it to accurately answer complex 'needle-in-a-haystack' queries spanning long video durations. The core of the system will involve processing entire 30-minute video segments, extracting key visual and auditory information, generating multimodal embeddings, and indexing them using LlamaIndex. An OpenAI Swarm-like orchestration will manage specialized agents that collaborate using an A2A protocol to perform visual search, event detection, and generate comprehensive summaries. MCP could be used to facilitate access to external video processing tools or contextual databases.

Status
Always open
Difficulty
Advanced
Points
500
Start the challenge to track prompts, tools, evaluation progress, and leaderboard position in one workspace.
Challenge at a glance
Host and timing
Vera

AI Research & Mentorship

Starts Available now
Evergreen challenge
Challenge brief

What you are building

The core problem, expected build, and operating context for this challenge.

Inspired by advancements in long-context multimodal understanding, this challenge tasks you with building a cutting-edge video intelligence system. You will integrate the Qwen3-VL model for robust video and image analysis with GPT-5 for higher-level reasoning and synthesis. The system will leverage LlamaIndex for advanced RAG over multimodal data, allowing it to accurately answer complex 'needle-in-a-haystack' queries spanning long video durations. The core of the system will involve processing entire 30-minute video segments, extracting key visual and auditory information, generating multimodal embeddings, and indexing them using LlamaIndex. An OpenAI Swarm-like orchestration will manage specialized agents that collaborate using an A2A protocol to perform visual search, event detection, and generate comprehensive summaries. MCP could be used to facilitate access to external video processing tools or contextual databases.

Datasets

Shared data for this challenge

Review public datasets and any private uploads tied to your build.

Loading datasets...
Learning goals

What you should walk away with

Master LlamaIndex for constructing advanced multimodal RAG pipelines, including chunking, embedding, and indexing video frames, audio transcripts, and object detections.

Integrate Qwen3-VL (or a similar high-performance VLM) for comprehensive visual scene understanding, object recognition, activity detection, and dense captioning across video segments.

Design an OpenAI Swarm-like multi-agent system where specialized agents (e.g., 'Visual Search Agent', 'Audio Transcriber Agent', 'Summary Agent') collaborate using an A2A protocol.

Leverage GPT-5's advanced reasoning and context understanding to synthesize information from various multimodal RAG retrievals and answer complex, nuanced queries about video content.

Implement strategies for 'extended thinking' to handle 'needle-in-a-haystack' scenarios, ensuring comprehensive search and cross-referencing of multimodal data over extended durations.

Develop tools or MCP interfaces for segmenting videos, extracting audio, and generating image sequences for multimodal processing.

Your progress

Participation status

You haven't started this challenge yet

Timeline and host

Operating window

Key dates and the organization behind this challenge.

Start date
Available now
Run mode
Evergreen challenge
Explore

Find another challenge

Jump to a random challenge when you want a fresh benchmark or a different problem space.

Useful when you want to pressure-test your workflow on a new dataset, new constraints, or a new evaluation rubric.

Tool Space Recipe

Draft
Evaluation

Frequently Asked Questions about Multimodal Video Intelligence with Qwen3-VL, GPT-5 & LlamaIndex