Multimodal Asset Generation
This challenge involves building an advanced generative AI system capable of producing creative marketing assets, including images with embedded text, based on complex briefs and brand guidelines. Leveraging the multimodal capabilities of Gemini 3 and Nano Banana Pro, participants will orchestrate a workflow that not only generates visually compelling images but also ensures accurate and contextually relevant text rendering directly within the image. The core of this challenge lies in integrating prompt optimization techniques using DSPy with sophisticated knowledge retrieval via LlamaIndex. This hybrid approach enables the system to dynamically adapt prompts for Gemini 3 and Nano Banana Pro, ensuring adherence to brand style guides and creative objectives fetched through RAG, while also self-correcting for improved text fidelity and image quality. This system will simulate a creative agency assistant, transforming abstract marketing concepts into concrete visual outputs.
What you are building
The core problem, expected build, and operating context for this challenge.
This challenge involves building an advanced generative AI system capable of producing creative marketing assets, including images with embedded text, based on complex briefs and brand guidelines. Leveraging the multimodal capabilities of Gemini 3 and Nano Banana Pro, participants will orchestrate a workflow that not only generates visually compelling images but also ensures accurate and contextually relevant text rendering directly within the image. The core of this challenge lies in integrating prompt optimization techniques using DSPy with sophisticated knowledge retrieval via LlamaIndex. This hybrid approach enables the system to dynamically adapt prompts for Gemini 3 and Nano Banana Pro, ensuring adherence to brand style guides and creative objectives fetched through RAG, while also self-correcting for improved text fidelity and image quality. This system will simulate a creative agency assistant, transforming abstract marketing concepts into concrete visual outputs.
Shared data for this challenge
Review public datasets and any private uploads tied to your build.
What you should walk away with
Master multimodal prompt engineering for Gemini 3 and Nano Banana Pro to control image composition, style, and embedded text attributes.
Implement DSPy's `Signature` and `Predict` modules to design a declarative pipeline for generating images and optimizing text rendering quality.
Integrate LlamaIndex with vector databases to perform RAG on a corpus of brand guidelines, marketing assets, and style examples, feeding context into DSPy prompts.
Build a feedback loop using DSPy's `BootstrapFewShot` or custom metrics to iteratively refine prompts and improve generated image text accuracy and aesthetic quality.
Develop a mechanism for parsing and validating text content within generated images, ensuring consistency with input requirements and brand messaging.
Design a scalable architecture for deploying multimodal generative agents, considering API rate limits and computational resources.
Explore advanced techniques for zero-shot and few-shot multimodal generation using Gemini 3 and Nano Banana Pro within the DSPy framework.
[ok] Wrote CHALLENGE.md
[ok] Wrote .versalist.json
[ok] Wrote eval/examples.json
Requires VERSALIST_API_KEY. Works with any MCP-aware editor.
DocsAI Research & Mentorship
Participation status
You haven't started this challenge yet
Operating window
Key dates and the organization behind this challenge.
Find another challenge
Jump to a random challenge when you want a fresh benchmark or a different problem space.