Agent Building
Advanced
Always open

Build a Proactive Executive Assistant Agent with OpenAI Agents SDK

Inspired by the recent discussion around advanced call-screening features, this challenge tasks you with developing a personalized, proactive AI executive assistant. This agent will autonomously manage communications, prioritize tasks, and synthesize information, effectively acting as a digital chief of staff. It should demonstrate complex reasoning, dynamic tool usage, and an understanding of user preferences to handle various professional scenarios. The core of this challenge involves leveraging the OpenAI Agents SDK to orchestrate a sophisticated agent workflow. You will implement function calling to integrate with external tools for managing schedules, emails, and information retrieval. The agent needs to exhibit nuanced decision-making, adapting its behavior based on the context of incoming communications and the user's current priorities, much like a human executive assistant would filter and manage information flow.

Challenge brief

What you are building

The core problem, expected build, and operating context for this challenge.

Inspired by the recent discussion around advanced call-screening features, this challenge tasks you with developing a personalized, proactive AI executive assistant. This agent will autonomously manage communications, prioritize tasks, and synthesize information, effectively acting as a digital chief of staff. It should demonstrate complex reasoning, dynamic tool usage, and an understanding of user preferences to handle various professional scenarios. The core of this challenge involves leveraging the OpenAI Agents SDK to orchestrate a sophisticated agent workflow. You will implement function calling to integrate with external tools for managing schedules, emails, and information retrieval. The agent needs to exhibit nuanced decision-making, adapting its behavior based on the context of incoming communications and the user's current priorities, much like a human executive assistant would filter and manage information flow.

Datasets

Shared data for this challenge

Review public datasets and any private uploads tied to your build.

Loading datasets...
Evaluation rubric

How submissions are scored

These dimensions define what the evaluator checks, how much each dimension matters, and which criteria separate a passable run from a strong one.

Max Score: 6
Dimensions
6 scoring checks
Binary
6 pass or fail dimensions
Ordinal
0 scaled dimensions
Dimension 1correcttoolinvocation

CorrectToolInvocation

Agent correctly identifies and invokes the appropriate tool (e.g., calendar for scheduling).

binary
Weight: 1
Binary check

This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.

Dimension 2accuratescheduling

AccurateScheduling

Scheduled event details (time, duration, attendees) match user request and calendar availability.

binary
Weight: 1
Binary check

This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.

Dimension 3keyinfoextraction

KeyInfoExtraction

All critical action items are extracted from the email summary task.

binary
Weight: 1
Binary check

This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.

Dimension 4responserelevance

ResponseRelevance

Semantic similarity of agent's conversational response to expected answer (0-1). • target: 0.9 • range: 0-1

binary
Weight: 1
Binary check

This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.

Dimension 5personalizationadherence

PersonalizationAdherence

Degree to which the agent's actions and responses align with specified user preferences (0-1). • target: 0.85 • range: 0-1

binary
Weight: 1
Binary check

This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.

Dimension 6toolexecutionsuccessrate

ToolExecutionSuccessRate

Percentage of tool invocations that successfully complete without errors (0-1). • target: 0.95 • range: 0-1

binary
Weight: 1
Binary check

This dimension contributes its full weight only when the submission satisfies the requirement. Partial credit is not awarded.

Learning goals

What you should walk away with

Master the OpenAI Agents SDK for defining agent behavior, memory management, and tool integration patterns with GPT-4o

Implement robust function calling mechanisms to enable agents to interact with external APIs like calendar (Google Calendar API) and email (e.g., Gmail API via Zapier NLA)

Design personalized prompting strategies and few-shot examples within the OpenAI Agents SDK to tailor agent responses and actions to individual user preferences and historical interactions

Build a browser automation tool using E2B to enable the agent to access web-based information or dashboards as part of its executive assistant duties

Orchestrate complex agent workflows that involve dynamic tool selection and sequential reasoning for tasks such as meeting scheduling, email summarization, and task delegation

Deploy and manage agent instances, considering aspects like state persistence and secure API key management within the OpenAI ecosystem

Integrate Vellum for continuous evaluation and prompt experimentation, using A/B testing and trace analysis to refine agent performance and reduce hallucinations

Design an interactive user interface using Ellipsis to provide a seamless conversational experience for the executive assistant agent

Start from your terminal
$npx -y @versalist/cli start build-a-proactive-executive-assistant-agent-with-openai-agents-sdk

[ok] Wrote CHALLENGE.md

[ok] Wrote .versalist.json

[ok] Wrote eval/examples.json

Requires VERSALIST_API_KEY. Works with any MCP-aware editor.

Docs
Manage API keys
Challenge at a glance
Host and timing
Vera

AI Research & Mentorship

Starts Available now
Evergreen challenge
Your progress

Participation status

You haven't started this challenge yet

Timeline and host

Operating window

Key dates and the organization behind this challenge.

Start date
Available now
Run mode
Evergreen challenge
Explore

Find another challenge

Jump to a random challenge when you want a fresh benchmark or a different problem space.

Useful when you want to pressure-test your workflow on a new dataset, new constraints, or a new evaluation rubric.

Tool Space Recipe

Draft
Evaluation
Rubric: 6 dimensions
·CorrectToolInvocation(1%)
·AccurateScheduling(1%)
·KeyInfoExtraction(1%)
·ResponseRelevance(1%)
·PersonalizationAdherence(1%)
·ToolExecutionSuccessRate(1%)
Gold items: 2 (2 public)

Frequently Asked Questions about Build a Proactive Executive Assistant Agent with OpenAI Agents SDK