Versalist Research · May 2026

An Inspectable Substrate for AI Skill Workflows

An episode-grounded trace exposes the behavior behind a skill score to any layer that needs to read it, without storing raw prompts, completions, or transcripts.

The trace-inspection loopFour labeled boxes in a row, connected by arrows: Author skill, Run on challenge, Scored episode, Read trace. A dashed return arrow underneath represents editing the skill and running again.1. Author skillpolicy + instructions2. Run on challengegraded task3. Scored episodedurable entity4. Read tracecalls behind scoreedit the skill and run again
Figure 1.The trace-inspection loop. A skill is authored, run against a challenge, and produces a scored episode. On request, a trace — a read-only projection over that episode — lets the evaluator walk back through the calls that produced the score.

1Abstract

When a platform reports that an AI skill scored 82 percent, the honest follow-up is operational: based on what? When some other layer proposes an edit to that skill, the question becomes: based on what does this edit improve it? Both questions are asked of the same episode. Versalist’s answer is a trace, a read-only projection of the ordered behavior events that produced the result.

A trace exposes bounded behavior metadata, not raw payloads. It gives any layer with reason to inspect a skill enough structure to walk the episode, compare runs through hashes, and identify incomplete capture, without turning every prompt and completion into stored liability. The substrate is consumer-agnostic by design.

2The problem

Scores are useful only when they can be challenged. A rubric breakdown can explain what was judged, but it does not show how the system moved through the work. The score still arrives as a verdict from a pipeline the evaluator has to trust.

Trace inspection changes the artifact under review. The evaluator can walk through the episode step by step: what call fired, whether it completed or failed, which model handled it, how long it took, and how many tokens moved through the call.

3Architecture

Externally, the system reduces to two nouns: episode and trace. Run is a verb and a classifier of episode type, not an object. Trajectory is a forward-looking view across episodes rather than a stored entity. The glossary below states each precisely.

EpisodePrimitive
One scored execution of a skill bundle against a challenge. It owns identity, ownership, step scores, outcome, and reproducibility hashes.
TraceProjection
A read-only view assembled over one episode. Its identity is the episode identifier; there is no separate trace table.
RunVerb only
Useful as a verb and as a classification of episode type, but not a durable object. External language should use episode rather than invent a run entity.
TrajectoryConcept only
A future-facing way to describe movement across episodes and events. A view to render at the application layer, not a stored entity.
Simplified trace architectureChallenge and skill bundle flow into runEpisode, which writes the Episode with its steps. When capture is enabled, the EpisodeTraceEmitter appends trace events and a summary; the trace is served on request by the traces API and rendered by the EpisodeTracePanel. Trace id equals episode id.Challengeenv · gold · rubricSkill bundlethe policy · versionedrunEpisode()episode-executor.tsEpisoderun_type · status · score%EpisodeStep[]dimension scoreswrites1 : Ntrace.id ≡ episode.idcaptured if flag onEpisodeTraceEmitterappends trace_events + episode_trace_summaryGET /api/v1/traces/[id]assembles EpisodeTraceDetailEpisodeTracePanelviewer · redacted by access mode
Figure 2. Simplified architecture. The challenge and skill bundle flow into runEpisode, which writes the durable episode and its steps. When capture is enabled, the trace projection is appended and served on request by the read-only viewer. The amber-dashed region indicates default-off, flag-gated behavior.

A trace says nothing about what should be done with it; that is by design. The same substrate can be read by any layer that has reason to ask what an episode did — whether a human reviewer diagnosing a failure, an automated optimizer proposing edits to a skill, an agent harness verifying its own execution against a recorded score, or an evaluation pipeline comparing runs across models. The substrate accommodates them by exposing structure — ordered events, status, reproducibility hashes, dropped-event counts — and treats what a particular layer does with that structure as that layer’s concern.

4Trace events

An episode trace is an ordered list of behavior events. Each event records its type, terminal status, the model that handled it, latency, and input and output token counts. The visible surface is intentionally narrow: enough to reconstruct the shape of scoring behavior, not enough to reconstruct the underlying conversation.

SeqEvent typeStatusModelLatencyTokens in / out
01model_callcompletedgpt-4o842 ms1,820 / 420
02judge_callcompletedrubric evaluator304 ms620 / 110
03model_callfailedfallback model1,204 ms900 / 0
Table 1. Illustrative trace events for one scored episode. The captured surface today is limited to model and judge calls; additional event types exist in the schema but are not yet emitted.

5Claim boundaries

To keep claims honest, the trace surface is defined as much by what it refuses to record as by what it shows.

  1. Behavior inspection only: the trace shows what the system did, not everything it saw.
  2. No raw prompts, raw completions, or full transcripts are stored for this trace surface.
  3. Hashes support comparison and reproducibility without turning payloads into warehouse data.
  4. Dropped-event counts are visible, because an incomplete trace should say it is incomplete.
  5. This is not a tamper-proof audit log, a payload replay system, or a full reward-integrity claim.

6Shipped versus roadmap

The schema points toward a broader evaluation substrate; this note claims only the surface that can be inspected today. Table 2 separates the two so the paper can be defended on its current evidence.

CapabilityStatusResearch reading
Episode with scored stepsShippedDurable entity carrying challenge, skill bundle, score percentage, and dimension scores.
Trace projection and viewerDark-launchedServed as a read-only projection over an episode and gated behind two default-off flags.
Captured event typesNarrow sliceModel calls and judge calls with status, model, latency, and token counts. Additional event types exist in the schema but are not yet emitted.
Payload references and per-event rewardScaffoldSchema direction exists. Payload-backed reward integrity is not a current claim.
Multi-episode trajectoryRoadmap conceptRendered, when needed, as a session view over many episode traces rather than a new stored trace header.
Table 2. Current product surface compared to schema scaffolding and roadmap concepts.
Cite this note

Versalist Research (2026). Inspectable Scores: Episode Traces for AI Skill Evaluation.