Edge Multimodal AI for AR Glasses: Real-time Assistant
This challenge involves developing an on-device, multimodal AI assistant tailored for AR glasses. The system needs to process real-time voice and visual inputs, combined with simulated EMG handwriting (as a gesture proxy), to provide context-aware, low-latency assistance. This assistant will leverage the multimodal capabilities of Gemini 3 Pro for advanced reasoning and LangGraph for robust state management, with a strong focus on edge inference optimization using TFLite.
What you are building
The core problem, expected build, and operating context for this challenge.
This challenge involves developing an on-device, multimodal AI assistant tailored for AR glasses. The system needs to process real-time voice and visual inputs, combined with simulated EMG handwriting (as a gesture proxy), to provide context-aware, low-latency assistance. This assistant will leverage the multimodal capabilities of Gemini 3 Pro for advanced reasoning and LangGraph for robust state management, with a strong focus on edge inference optimization using TFLite.
Shared data for this challenge
Review public datasets and any private uploads tied to your build.
What you should walk away with
Master the integration of `Gemini 3 Pro` for sophisticated multimodal understanding and generation, handling combined inputs from voice, vision, and contextual data for real-time problem-solving.
Implement a robust, stateful conversational workflow using `LangGraph` to manage user interactions, context switching, and multi-turn dialogues for the AR assistant.
Utilize `Fixie` for building a highly responsive, natural language conversational interface, specifically tailored for voice input and output on an AR device, focusing on low latency and natural turn-taking.
Optimize and deploy generative AI components for on-device inference using `TFLite`, including model quantization and compilation for efficient execution on resource-constrained edge hardware.
Design and implement a unified input pipeline that fuses real-time audio streams (voice), camera feeds (vision), and simulated gesture inputs (e.g., from an EMG sensor proxy) into a coherent multimodal context for the AI assistant.
[ok] Wrote CHALLENGE.md
[ok] Wrote .versalist.json
[ok] Wrote eval/examples.json
Requires VERSALIST_API_KEY. Works with any MCP-aware editor.
DocsAI Research & Mentorship
Participation status
You haven't started this challenge yet
Operating window
Key dates and the organization behind this challenge.
Find another challenge
Jump to a random challenge when you want a fresh benchmark or a different problem space.