Build a Proactive Executive Assistant Agent with OpenAI Agents SDK
Inspired by the recent discussion around advanced call-screening features, this challenge tasks you with developing a personalized, proactive AI executive assistant. This agent will autonomously manage communications, prioritize tasks, and synthesize information, effectively acting as a digital chief of staff. It should demonstrate complex reasoning, dynamic tool usage, and an understanding of user preferences to handle various professional scenarios. The core of this challenge involves leveraging the OpenAI Agents SDK to orchestrate a sophisticated agent workflow. You will implement function calling to integrate with external tools for managing schedules, emails, and information retrieval. The agent needs to exhibit nuanced decision-making, adapting its behavior based on the context of incoming communications and the user's current priorities, much like a human executive assistant would filter and manage information flow.
What you are building
The core problem, expected build, and operating context for this challenge.
Inspired by the recent discussion around advanced call-screening features, this challenge tasks you with developing a personalized, proactive AI executive assistant. This agent will autonomously manage communications, prioritize tasks, and synthesize information, effectively acting as a digital chief of staff. It should demonstrate complex reasoning, dynamic tool usage, and an understanding of user preferences to handle various professional scenarios. The core of this challenge involves leveraging the OpenAI Agents SDK to orchestrate a sophisticated agent workflow. You will implement function calling to integrate with external tools for managing schedules, emails, and information retrieval. The agent needs to exhibit nuanced decision-making, adapting its behavior based on the context of incoming communications and the user's current priorities, much like a human executive assistant would filter and manage information flow.
Shared data for this challenge
Review public datasets and any private uploads tied to your build.
How submissions are scored
These dimensions define what the evaluator checks, how much each dimension matters, and which criteria separate a passable run from a strong one.
CorrectToolInvocation
Agent correctly identifies and invokes the appropriate tool (e.g., calendar for scheduling).
This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.
AccurateScheduling
Scheduled event details (time, duration, attendees) match user request and calendar availability.
This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.
KeyInfoExtraction
All critical action items are extracted from the email summary task.
This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.
ResponseRelevance
Semantic similarity of agent's conversational response to expected answer (0-1). • target: 0.9 • range: 0-1
This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.
PersonalizationAdherence
Degree to which the agent's actions and responses align with specified user preferences (0-1). • target: 0.85 • range: 0-1
This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.
ToolExecutionSuccessRate
Percentage of tool invocations that successfully complete without errors (0-1). • target: 0.95 • range: 0-1
This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.
What you should walk away with
Master the OpenAI Agents SDK for defining agent behavior, memory management, and tool integration patterns with GPT-4o
Implement robust function calling mechanisms to enable agents to interact with external APIs like calendar (Google Calendar API) and email (e.g., Gmail API via Zapier NLA)
Design personalized prompting strategies and few-shot examples within the OpenAI Agents SDK to tailor agent responses and actions to individual user preferences and historical interactions
Build a browser automation tool using E2B to enable the agent to access web-based information or dashboards as part of its executive assistant duties
Orchestrate complex agent workflows that involve dynamic tool selection and sequential reasoning for tasks such as meeting scheduling, email summarization, and task delegation
Deploy and manage agent instances, considering aspects like state persistence and secure API key management within the OpenAI ecosystem
Integrate Vellum for continuous evaluation and prompt experimentation, using A/B testing and trace analysis to refine agent performance and reduce hallucinations
Design an interactive user interface using Ellipsis to provide a seamless conversational experience for the executive assistant agent
[ok] Wrote CHALLENGE.md
[ok] Wrote .versalist.json
[ok] Wrote eval/examples.json
Requires VERSALIST_API_KEY. Works with any MCP-aware editor.
DocsAI Research & Mentorship
Participation status
You haven't started this challenge yet
Operating window
Key dates and the organization behind this challenge.
Find another challenge
Jump to a random challenge when you want a fresh benchmark or a different problem space.