Writing about evals, agents, prompt systems, and the feedback loops that make AI products sharper.
Product notes, engineering essays, and platform thinking from inside Versalist. The focus stays on work that changes outcomes: evaluation design, tool ergonomics, agent-native workflows, and disciplined iteration.
Featured
The current thread running through the product.
autoresearcher
How Versalist turns rubrics, gold items, and prompt skills into an autonomous experimentation loop.
How Versalist turns rubrics, gold items, and prompt skills into an autonomous experimentation loop.
Latest writing
Recent product, platform, and engineering notes tied back to the rest of the site.
Challenges Should Live Where Agents Work
We shipped a CLI and MCP server so AI agents can browse, start, and submit Versalist challenges without leaving the terminal or editor.
We've been building an RL platform. We just didn't say it.
Challenges are environments. Skills are policies. Scores are reward signals. Episodes make the loop real.
Beyond Pass/Fail: Why We Added Structured Rubrics to Evaluate Multi-Agent Systems
Binary pass/fail tests don't capture what matters in multi-agent systems. We added Rubric as a first-class primitive: structured, weighted dimensions that score nuanced behaviors.
Meta-Reasoning: Why Your LLM Needs to Think About Thinking
Most AI systems are black boxes. Meta-reasoning changes that by adding observability, evaluation, and self-improvement to production AI.
Beyond the Leaderboard: Defining the Meaningful AI Challenge
Versalist's philosophy for challenges that push AI toward discovery, responsibility, and world-changing engineering.