planning

Design Multimodal RAG Pipeline

Inspect the original prompt language first, then copy or adapt it once you know how it fits your workflow.

Linked challenge: Multimodal Video Intelligence with Qwen3-VL, GPT-5 & LlamaIndex

Format

Text-first

Lines

Sections

Linked challenge

Multimodal Video Intelligence with Qwen3-VL, GPT-5 & LlamaIndex

Prompt source

Original prompt text with formatting preserved for inspection.

1 lines

1 sections

No variables

0 checklist items

Outline the full multimodal RAG pipeline for processing a 30-minute video. Detail how video will be segmented, how Qwen3-VL will extract visual features and text, how audio will be transcribed, and how LlamaIndex will index these disparate data types into a unified knowledge base.

Adaptation plan

Keep the source stable, then change the prompt in a predictable order so the next run is easier to evaluate.

Keep stable

Preserve the role framing, objective, and reporting structure so comparison runs stay coherent.

Tune next

Swap in your own domain constraints, anomaly thresholds, and examples before you branch variants.

Verify after

Check whether the prompt asks for the right evidence, confidence signal, and escalation path.

Prompt diagnostics

Variables

Lists

Code blocks

Purpose

planning

This prompt is mostly narrative and instruction-driven, so adapt examples and output constraints before you rewrite the structure.

Linked challenge

Multimodal Video Intelligence with Qwen3-VL, GPT-5 & LlamaIndex

Inspired by advancements in long-context multimodal understanding, this challenge tasks you with building a cutting-edge video intelligence system. You will integrate the Qwen3-VL model for robust video and image analysis with GPT-5 for higher-level reasoning and synthesis. The system will leverage LlamaIndex for advanced RAG over multimodal data, allowing it to accurately answer complex 'needle-in-a-haystack' queries spanning long video durations. The core of the system will involve processing entire 30-minute video segments, extracting key visual and auditory information, generating multimodal embeddings, and indexing them using LlamaIndex. An OpenAI Swarm-like orchestration will manage specialized agents that collaborate using an A2A protocol to perform visual search, event detection, and generate comprehensive summaries. MCP could be used to facilitate access to external video processing tools or contextual databases.

Open challenge

Related prompts

Browse library

Implement Video Processing & Qwen3-VL Integration

implementation

Orchestrate Swarm Agents & GPT-5 Synthesis

implementation

Execute 'Needle-in-a-Haystack' Queries

testing