Versalist ResearchMay 2026

An Inspectable Substrate for AI Skill Workflows

An episode-grounded trace exposes the behavior behind a skill score to any layer that needs to read it, without storing raw prompts, completions, or transcripts.

System view / trace inspection loop

Figure 1. The trace-inspection loop. A skill is authored, run against a challenge, and produces a scored episode. On request, a trace, a read-only projection over that episode, lets the evaluator walk back through the calls that produced the score.

Abstract

When a platform reports that an AI skill scored 82 percent, the honest follow-up is operational: based on what? When some other layer proposes an edit to that skill, the question becomes: based on what does this edit improve it? Both questions are asked of the same episode. Versalist’s answer is a trace, a read-only projection of the ordered behavior events that produced the result.

A trace exposes bounded behavior metadata, not raw payloads. It gives any layer with reason to inspect a skill enough structure to walk the episode, compare runs through hashes, and identify incomplete capture, without turning every prompt and completion into stored liability. The substrate is consumer-agnostic by design.

The problem

Scores are useful only when they can be challenged. A rubric breakdown can explain what was judged, but it does not show how the system moved through the work. The score still arrives as a verdict from a pipeline the evaluator has to trust.

Trace inspection changes the artifact under review. The evaluator can walk through the episode step by step: what call fired, whether it completed or failed, which model handled it, how long it took, and how many tokens moved through the call.

Architecture

Externally, the system reduces to two nouns: episode and trace. Run is a verb and a classifier of episode type, not an object. Trajectory is a forward-looking view across episodes rather than a stored entity. The glossary below states each precisely.

EpisodePrimitive: One scored execution of a skill bundle against a challenge. It owns identity, ownership, step scores, outcome, and reproducibility hashes.
TraceProjection: A read-only view assembled over one episode. Its identity is the episode identifier; there is no separate trace table.
RunVerb only: Useful as a verb and as a classification of episode type, but not a durable object. External language should use episode rather than invent a run entity.
TrajectoryConcept only: A future-facing way to describe movement across episodes and events. A view to render at the application layer, not a stored entity.

Figure 2. Simplified architecture. The challenge and skill bundle flow into runEpisode, which writes the durable episode and its steps. When capture is enabled, the trace projection is appended and served on request by the read-only viewer. The amber-dashed region indicates default-off, flag-gated behavior.

A trace says nothing about what should be done with it; that is by design. The same substrate can be read by any layer that has reason to ask what an episode did, whether a human reviewer diagnosing a failure, an automated optimizer proposing edits to a skill, an agent harness verifying its own execution against a recorded score, or an evaluation pipeline comparing runs across models. The substrate accommodates them by exposing structure: ordered events, status, reproducibility hashes, and dropped-event counts, and treats what a particular layer does with that structure as that layer’s concern.

Trace events

An episode trace is an ordered list of behavior events. Each event records its type, terminal status, the model that handled it, latency, and input and output token counts. The visible surface is intentionally narrow: enough to reconstruct the shape of scoring behavior, not enough to reconstruct the underlying conversation.

Seq	Event type	Status	Model	Latency	Tokens in / out
01	model_call	completed	gpt-4o	842 ms	1,820 / 420
02	judge_call	completed	rubric evaluator	304 ms	620 / 110
03	model_call	failed	fallback model	1,204 ms	900 / 0

Table 1. Illustrative trace events for one scored episode. The captured surface today is limited to model and judge calls; additional event types exist in the schema but are not yet emitted.

Claim boundaries

To keep claims honest, the trace surface is defined as much by what it refuses to record as by what it shows.

01Behavior inspection only: the trace shows what the system did, not everything it saw.
02No raw prompts, raw completions, or full transcripts are stored for this trace surface.
03Hashes support comparison and reproducibility without turning payloads into warehouse data.
04Dropped-event counts are visible, because an incomplete trace should say it is incomplete.
05This is not a tamper-proof audit log, a payload replay system, or a full reward-integrity claim.

Cite this note

Versalist Research (2026). Inspectable Scores: Episode Traces for AI Skill Evaluation.

More from the research notebook

Working papers and field notes

Inspectable Evaluation

Evaluation Generation as a Post-Selection Step

Most generated challenges never ship, and evaluations written for them are wasted work, so the pipeline defers evaluation generation until promotion. What that buys, what regenerate-not-edit means in practice, and what the generated module is not.

Read note →Skills as Accuracy Primitives

Skills as Accuracy Primitives

On Versalist a skill is a named, versioned bundle of instructions and tool access. We call skills accuracy primitives because every run is observed, and every proposed change must win a scored comparison on the skill's own history before it can be promoted.

Read note →Agent Training Methodology

The Adversary Agent: Hostile Review as a Pipeline Step

A second agent, prompted to assume the code is broken until proven otherwise, reviews changes it did not write. We describe the prompt shape, how the step is enforced, and what happened the day it returned nothing.

Read note →