AI Development
Advanced
Always open

Multi-Modal Image Editing Agent

This challenge involves building an advanced, multi-modal AI agent system capable of interpreting complex text and voice prompts to perform sophisticated image editing operations. The system will leverage a graph-based workflow to break down high-level requests into a sequence of atomic image manipulation tasks, executing them via integrated tools. Developers will focus on enabling 'extended thinking' within the agent to handle nuanced instructions, ensuring iterative refinement and adaptive problem-solving during the editing process. This project requires designing a robust LangGraph state machine that manages the image editing pipeline. The core agent, powered by Gemini 2.5 Pro, will interpret multi-modal input, generate intermediate editing plans, and dynamically select and invoke image manipulation tools. The challenge emphasizes creating a seamless user experience where text or voice commands translate into visually stunning, refined image outputs through intelligent agent orchestration and tool integration.

Status
Always open
Difficulty
Advanced
Points
500
Start the challenge to track prompts, tools, evaluation progress, and leaderboard position in one workspace.
Challenge at a glance
Host and timing
Vera

AI Research & Mentorship

Starts Available now
Evergreen challenge
Challenge brief

What you are building

The core problem, expected build, and operating context for this challenge.

This challenge involves building an advanced, multi-modal AI agent system capable of interpreting complex text and voice prompts to perform sophisticated image editing operations. The system will leverage a graph-based workflow to break down high-level requests into a sequence of atomic image manipulation tasks, executing them via integrated tools. Developers will focus on enabling 'extended thinking' within the agent to handle nuanced instructions, ensuring iterative refinement and adaptive problem-solving during the editing process. This project requires designing a robust LangGraph state machine that manages the image editing pipeline. The core agent, powered by Gemini 2.5 Pro, will interpret multi-modal input, generate intermediate editing plans, and dynamically select and invoke image manipulation tools. The challenge emphasizes creating a seamless user experience where text or voice commands translate into visually stunning, refined image outputs through intelligent agent orchestration and tool integration.

Datasets

Shared data for this challenge

Review public datasets and any private uploads tied to your build.

Loading datasets...
Learning goals

What you should walk away with

Master LangGraph for building stateful, cyclic agent workflows with dynamic tool invocation and decision nodes.

Implement multi-modal input processing with Gemini 2.5 Pro, leveraging its advanced understanding for image editing directives.

Design and develop robust tool integration via function calling, connecting the agent to external image processing libraries (e.g., OpenCV, Pillow) or simulated APIs.

Build 'extended thinking' pipelines using Gemini 2.5 Pro's capabilities for iterative self-correction and adaptive reasoning in complex generative tasks.

Orchestrate agent-to-tool communication patterns for efficient and reliable execution of image manipulation commands.

Deploy the agent system in a local environment, demonstrating multi-modal interaction and visual output generation.

Your progress

Participation status

You haven't started this challenge yet

Timeline and host

Operating window

Key dates and the organization behind this challenge.

Start date
Available now
Run mode
Evergreen challenge
Explore

Find another challenge

Jump to a random challenge when you want a fresh benchmark or a different problem space.

Useful when you want to pressure-test your workflow on a new dataset, new constraints, or a new evaluation rubric.

Tool Space Recipe

Draft
Evaluation

Frequently Asked Questions about Multi-Modal Image Editing Agent