Agent Training Stack

Inference and compute providers plug into that loop as execution surfaces. They are not the product. The training loop is.

The training loop

A Versalist challenge is not a prompt or a leaderboard entry. It is a repeatable environment where an agent attempts a task, produces evidence, receives a score, and turns the working parts of the run into something reusable.

In Versalist, “training” refers to skill iteration against reward signals, not weight updates. The loop improves the agent-facing operating layer: challenge definitions, rollout traces, judge feedback, reward interpretation, and the reusable skills that guide the next attempt.

A normal run moves through five stages:

Define the challenge. Capture the task, constraints, input data, expected artifacts, allowed tools, and acceptance criteria. Read challenge docs.
Run the rollout. Execute the agent against the challenge. The rollout is the episode where model calls, tool calls, logs, and artifacts are produced.
Judge the result. Use deterministic checks, rubrics, baseline comparisons, or reviewer feedback to decide whether the attempt worked.
Convert judgment into reward. Turn the outcome into structured signal: score, pass/fail, failure mode, trace evidence, and improvement notes.
Update the skill. Promote a pattern only when the rollout evidence supports it. Read skills docs.

What each stage produces

The value of the stack comes from carrying evidence forward. Each stage should produce an artifact the next stage can inspect, replay, or improve.

Challenge

Defines the world the agent can act in — inputs, tools, data, constraints, success criteria. Runs stay comparable across agents and model choices because the boundary is fixed before execution starts, and task quality is inspectable before any run begins.

Rollout

Produces the transcript of what happened: decisions, model calls, tool calls, logs, intermediate files, and final artifacts. This is the evidence judges and skill updates inspect to find where the agent succeeded or drifted.

Reward

Judges and reward logic turn raw behavior into structured feedback that can be compared across attempts. Deterministic tests run where they exist. Rubric-based judgment is kept separate from marketing claims.

Skill update

A successful pattern becomes durable only when the rollout evidence backs it, and only at the scope of the task it actually improved. This is what prevents one-off hacks from being promoted to general guidance and what creates a feedback path from production runs back into skills.

Where providers plug in

Inference clouds and compute clouds become important when a challenge needs a model endpoint, a custom runtime, a GPU job, or a durable artifact path. They are integration surfaces inside the loop, not the headline.

BYOK inference (live)

Provider keys can be stored through Integrations when a workflow needs Versalist to route model calls through a user-managed provider account. Best fit for policy calls, judges, and small rollout workloads. Provider credentials stay separate from Versalist platform API keys.

Custom inference endpoints (planned)

Point Versalist at an OpenAI-compatible or provider-hosted endpoint your team already runs. Intended for teams already operating their own model surface. This page will mark it live once the adapter ships.

Compute runtime adapters (planned)

Cluster and runtime adapters for container-shaped rollouts, GPU jobs, and training workloads. Aimed at long-running episodes and RL-style training loops. A provider appearing in the directory is not proof these adapters exist yet.

Logs, scores, and reproducibility

Even provider-backed work still produces Versalist-owned evidence: run logs, traces, outputs, scores, and the skill change that resulted. The result has to stay inspectable after execution, or the provider integration is just a badge.

One rollout, end to end

A normal run should be easy to explain without naming a partner or benchmark:

The operator selects a challenge with a clear environment, inputs, and judging strategy.
The configured model and tools produce an attempt. Runtime choice is an implementation detail, not the headline.
Judges score the attempt against tests, rubrics, or review signals, producing a reward or failure mode comparable to prior runs.
If the evidence is strong enough, the pattern becomes a skill update. If not, the trace stays useful debugging data.

The system either moves evidence through the loop, or it does not.

Documentation