autoresearcher
How Versalist turns rubrics, gold items, and prompt skills into an autonomous experimentation loop.
I'm not a machine learning researcher. I've never published a paper. I couldn't explain the Muon optimizer at a whiteboard.
But recently I pointed an AI agent at a query-expansion model, told it what metric to optimize, and went to sleep. Eight hours later I woke up to dozens of completed experiments, a smaller model outperforming the previous baseline, and an experiment log I couldn't stop reading.
Each entry read like a tiny research paper. Hypothesis, method, result, decision. The agent tried things I wouldn't have considered: reordering few-shot examples, inserting chain-of-thought preambles, targeting the weakest rubric dimension. Some ideas bombed. The good ones compounded.
I learned more about prompt optimization from that log than from months of following ML researchers on Twitter.
That is why we built this into Versalist.
What is autoresearch?
The concept comes from Andrej Karpathy's autoresearch. Give an AI agent a training setup, a metric to optimize, and a time budget. Let it experiment autonomously. It modifies code, runs the experiment, checks if the result improved, keeps or discards, and repeats. Hundreds of experiments. Additive improvements that compound.
The key insight: you don't write the code. You write the research program. A markdown file that tells the agent what to explore, what constraints to follow, and what strategies to try. You're a research director, not a coder.
Karpathy's version targets model training on GPUs. We realized the same loop applies to any challenge on Versalist, and it doesn't need a GPU to start.
How autoresearcher works on Versalist
Every Versalist challenge already has the building blocks: a skill (your prompt/instructions) that gets evaluated, gold items (test cases) the skill is tested against, a rubric with weighted dimensions that produce a score, and an episode runner that executes the evaluation and returns results.
autoresearcher wraps these into an autonomous loop. You write a research program. The agent reads it plus the experiment history, hypothesizes a change, modifies the skill, evaluates it through the existing episode runner, compares the score to best-so-far, and keeps or discards. Repeat until the budget is exhausted.
You set a budget, write your research program, pick a skill to optimize, and hit Start. Then you go to lunch. Or to sleep. Or to your actual job.
What you get back
- Score progression chart: Visual improvement over time so you can see where the biggest gains landed.
- Complete experiment log: Every hypothesis, change, score delta, and keep/discard decision with reasoning.
- Best-performing skill version: Ready to submit to the leaderboard or deploy in production.
- Top Discoveries summary: The changes that moved the needle most, highlighted for quick review.
The research program
The interesting part is not the automation. It's the research program. This is where the human directs the research.
A basic research program sets a goal ("maximize overall score on the query expansion challenge"), lists strategies to explore (chain-of-thought decomposition, few-shot example selection, output format optimization), and defines constraints (keep the prompt under 2000 tokens, make one focused change per experiment).
A more sophisticated one encodes methodology: broad structural changes first, then focus on the weakest rubric dimension, then fine-tune the best approach. Every fifth experiment, try something radically different to escape local optima. If three consecutive experiments are discarded, change strategy entirely.
The research program is the product. It separates a marginal improvement from a large one. The human sets the research direction. The agent handles the experimentation.
What we learned building this
- The agent is a better researcher than you expect: It doesn't just try random changes. By experiment 10, it identifies patterns across its own history: which rubric dimensions have headroom, which strategies keep producing wins, which directions are exhausted. Real research methodology, emerging from the loop itself.
- The logs are the curriculum: We thought the experiment log was a debugging tool. It turned out to be the most valuable output. Reading how an agent systematically improves a prompt, forms hypotheses, tests them, reasons about failures, is the best AI engineering education we've found.
- Iteration beats intuition: A prompt that has been through 25 focused experiments will beat a brilliant first draft every time. Not because the agent is smarter, but because it is more patient. It tries the boring changes you would skip.
- The barrier to entry drops: Karpathy's autoresearch needs a GPU and a training script. Ours needs a prompt and a challenge rubric. Same loop, different surface area. No ML expertise or compute infrastructure required.
The bigger picture
The gap between "I have an idea" and "I have results" is shrinking from weeks to hours. The bottleneck is no longer intelligence or tenacity. It is knowing which direction to point the agent.
Versalist challenges provide the direction. The rubric. The test cases. The evaluation framework. autoresearcher provides the engine.
You provide the research program. The agent does the rest while you sleep.
autoresearcher is available now on select Versalist challenges in private beta. Join a challenge to try it.
We've been building an RL platform. We just didn't say it.
Challenges are environments. Skills are policies. Scores are reward signals. Episodes make the loop real.
Challenges Should Live Where Agents Work
We shipped a CLI and MCP server so AI agents can browse, start, and submit Versalist challenges without leaving the terminal or editor.
Beyond Pass/Fail: Why We Added Structured Rubrics to Evaluate Multi-Agent Systems
Binary pass/fail tests don't capture what matters in multi-agent systems. We added Rubric as a first-class primitive: structured, weighted dimensions that score nuanced behaviors.