Agentic Video Scene Skipper
This challenge involves building an advanced agentic system that can interpret complex natural language requests to navigate video content. You will leverage Gemini 3 Pro's multimodal understanding and Langroid's robust agent capabilities to process user queries, perform semantic search over video metadata, and execute simulated playback commands. The system must accurately identify specific scenes based on descriptions, character names, or quotes, demonstrating sophisticated hybrid reasoning and MCP tool integration for real-time control of a simulated media player. This project focuses on combining cutting-edge LLMs with specialized agent frameworks and advanced RAG techniques. You will design a graph-based workflow for parsing queries, retrieving relevant video segments, and interacting with external tools, simulating a highly responsive and intelligent content navigation system. Success will require meticulous prompt engineering, efficient data indexing, and robust error handling to deliver a seamless user experience.
AI Research & Mentorship
What you are building
The core problem, expected build, and operating context for this challenge.
This challenge involves building an advanced agentic system that can interpret complex natural language requests to navigate video content. You will leverage Gemini 3 Pro's multimodal understanding and Langroid's robust agent capabilities to process user queries, perform semantic search over video metadata, and execute simulated playback commands. The system must accurately identify specific scenes based on descriptions, character names, or quotes, demonstrating sophisticated hybrid reasoning and MCP tool integration for real-time control of a simulated media player. This project focuses on combining cutting-edge LLMs with specialized agent frameworks and advanced RAG techniques. You will design a graph-based workflow for parsing queries, retrieving relevant video segments, and interacting with external tools, simulating a highly responsive and intelligent content navigation system. Success will require meticulous prompt engineering, efficient data indexing, and robust error handling to deliver a seamless user experience.
Shared data for this challenge
Review public datasets and any private uploads tied to your build.
What you should walk away with
Master Langroid for developing stateful, multi-step conversational agents for complex interactions with external systems.
Implement advanced RAG pipelines with LlamaIndex using hybrid indexing (vector + keyword) for comprehensive video content metadata and transcript segment retrieval.
Design MCP-enabled tool integration for a simulated video player API, allowing Langroid agents to control playback, skip to identified scenes, and retrieve current video status.
Utilize Gemini 3 Pro's multimodal capabilities for understanding nuanced natural language queries about video content and accurately inferring user intent from context.
Build extended thinking patterns within the Langroid agent, enabling it to decompose complex scene descriptions into executable search queries and precise playback commands.
Deploy a lightweight vector database (e.g., ChromaDB, Milvus) for efficient similarity search and retrieval of video scene embeddings and associated metadata.
Develop a robust prompt engineering strategy for Claude Sonnet 4 to refine initial Gemini 2.5 Pro outputs, ensuring precise scene identification and minimizing false positive scene skips.
Participation status
You haven't started this challenge yet
Operating window
Key dates and the organization behind this challenge.
Find another challenge
Jump to a random challenge when you want a fresh benchmark or a different problem space.