Back to App

Documentation

Public guides, reference, and product workflows.

Guide
Product Guides
Public docs

Agent Training Stack

Versalist is built around the full loop an agent needs to improve: challenges, rollouts, judges, rewards, and skill updates.

Read time
1 min
Word count
...
Sections
1
Read llms.txt
Starting point
Challenge

The task, environment rules, allowed tools, and success criteria.

Execution unit
Rollout

One or more agent attempts against the challenge under controlled conditions.

Scoring layer
Judge + reward

Tests, rubrics, traces, and outcome signals turn behavior into feedback.

Improvement unit
Skill update

Reusable knowledge is promoted only after evidence from the run supports it.

Product architecture
The important product surface is the loop, not any one provider.
Inference and compute providers matter because agents need models and runtime capacity. They plug into the training loop as execution surfaces. They are not the core story by themselves.
Read integration boundaries

The training loop

A Versalist challenge is not just a prompt or leaderboard entry. It is a repeatable environment where an agent can attempt a task, produce evidence, receive a score, and turn the successful parts of the run into something reusable.

In Versalist, the training loop refers to skill iteration against reward signals, not weight updates. The loop improves the agent-facing operating layer: challenge definitions, rollout traces, judge feedback, reward interpretation, and reusable skills that guide the next attempt.

1
Define the challenge
Capture the task, constraints, input data, expected artifacts, allowed tools, and acceptance criteria.
Read challenge docs
2
Run the rollout
Execute the agent against the challenge. The rollout is the episode where model calls, tool calls, logs, and artifacts are produced.
3
Judge the result
Use deterministic checks, rubrics, baseline comparisons, or reviewer feedback to decide whether the attempt worked.
4
Convert judgment into reward
Turn the outcome into structured signal: score, pass/fail state, failure mode, trace evidence, and improvement notes.
5
Update the skill
Promote useful patterns into a reusable skill only when the rollout evidence supports the change.
Read skills docs

What each stage produces

The value of the stack comes from carrying evidence forward. Each stage should produce an artifact that the next stage can inspect, replay, or improve.

Environment
Task and operating boundary
The challenge defines the world the agent is allowed to act in: inputs, tools, data, constraints, and success criteria.
Keeps runs comparable across agents and model choices
Makes task quality inspectable before execution starts
Rollout
Execution trace
A rollout produces the transcript of what happened: decisions, model calls, tool calls, logs, intermediate files, and final artifacts.
Shows where the agent succeeded or drifted
Gives judges and skill updates concrete evidence to inspect
Reward
Outcome signal
Judges and reward logic turn raw behavior into structured feedback that can be compared across attempts.
Supports deterministic tests when they exist
Keeps rubric-based judgment separate from marketing claims
Skill update
Reusable improvement
A successful pattern becomes durable only after it is backed by rollout evidence and scoped to the task it actually improved.
Prevents one-off hacks from becoming general guidance
Creates a feedback path from production runs back into skills

Where providers plug in

Inference clouds and compute clouds are integration surfaces inside the loop. They become important when a challenge needs a model endpoint, a custom runtime, a GPU job, or a durable artifact path.

Live surface
BYOK inference
Provider keys can be stored through Integrations when a workflow needs Versalist to route model calls through a user-managed provider account.
Best fit for policy calls, judges, and small rollout workloads
Provider credentials stay separate from Versalist platform API keys
Open integrations docs
RFC stage
Custom inference endpoints
The D1 roadmap is a bring-your-own endpoint path for OpenAI-compatible or provider-hosted model surfaces.
Useful when the team already operates a model endpoint
Not treated as live runtime support until the adapter is shipped
RFC stage
Compute runtime adapters
The D2+ roadmap covers cluster and runtime adapters for container-shaped rollouts, GPU jobs, and training workloads.
Relevant for long-running episodes and RL-style training loops
Provider directory entries are not proof that these adapters exist
Artifact layer
Logs, scores, and reproducibility
Provider-backed work still needs Versalist-owned evidence: run logs, traces, outputs, scores, and the skill change that resulted.
Keeps the challenge result inspectable after execution
Turns provider capacity into product evidence rather than a badge

One rollout walkthrough

A normal end-to-end run should be easy to explain without naming a partner or benchmark. The system either moves evidence through the loop or it does not.

1
A challenge is selected
The operator chooses a task with a clear environment, inputs, and judging strategy.
2
The agent runs against it
The configured model and tools produce an attempt. Runtime choice is an implementation detail, not the headline.
3
Judges score the attempt
Tests, rubrics, and review signals produce a reward or failure mode that can be compared to prior runs.
4
The useful lesson is promoted
If the evidence is strong enough, the pattern becomes a skill update. If not, the trace remains useful debugging data.
Previous
Glossary

Shared vocabulary for AI engineering workflows.

Next
Challenges

How challenge discovery, runs, and leaderboards work.

Was this page helpful?

Use the quick feedback buttons so we can tighten the docs where the flow still feels unclear.

Back to Documentation