An Inspectable Substrate for AI Skill Workflows
An episode-grounded trace exposes the behavior behind a skill score to any layer that needs to read it, without storing raw prompts, completions, or transcripts.
1Abstract
When a platform reports that an AI skill scored 82 percent, the honest follow-up is operational: based on what? When some other layer proposes an edit to that skill, the question becomes: based on what does this edit improve it? Both questions are asked of the same episode. Versalist’s answer is a trace, a read-only projection of the ordered behavior events that produced the result.
A trace exposes bounded behavior metadata, not raw payloads. It gives any layer with reason to inspect a skill enough structure to walk the episode, compare runs through hashes, and identify incomplete capture, without turning every prompt and completion into stored liability. The substrate is consumer-agnostic by design.
2The problem
Scores are useful only when they can be challenged. A rubric breakdown can explain what was judged, but it does not show how the system moved through the work. The score still arrives as a verdict from a pipeline the evaluator has to trust.
Trace inspection changes the artifact under review. The evaluator can walk through the episode step by step: what call fired, whether it completed or failed, which model handled it, how long it took, and how many tokens moved through the call.
3Architecture
Externally, the system reduces to two nouns: episode and trace. Run is a verb and a classifier of episode type, not an object. Trajectory is a forward-looking view across episodes rather than a stored entity. The glossary below states each precisely.
- EpisodePrimitive
- One scored execution of a skill bundle against a challenge. It owns identity, ownership, step scores, outcome, and reproducibility hashes.
- TraceProjection
- A read-only view assembled over one episode. Its identity is the episode identifier; there is no separate trace table.
- RunVerb only
- Useful as a verb and as a classification of episode type, but not a durable object. External language should use episode rather than invent a run entity.
- TrajectoryConcept only
- A future-facing way to describe movement across episodes and events. A view to render at the application layer, not a stored entity.
A trace says nothing about what should be done with it; that is by design. The same substrate can be read by any layer that has reason to ask what an episode did — whether a human reviewer diagnosing a failure, an automated optimizer proposing edits to a skill, an agent harness verifying its own execution against a recorded score, or an evaluation pipeline comparing runs across models. The substrate accommodates them by exposing structure — ordered events, status, reproducibility hashes, dropped-event counts — and treats what a particular layer does with that structure as that layer’s concern.
4Trace events
An episode trace is an ordered list of behavior events. Each event records its type, terminal status, the model that handled it, latency, and input and output token counts. The visible surface is intentionally narrow: enough to reconstruct the shape of scoring behavior, not enough to reconstruct the underlying conversation.
| Seq | Event type | Status | Model | Latency | Tokens in / out |
|---|---|---|---|---|---|
| 01 | model_call | completed | gpt-4o | 842 ms | 1,820 / 420 |
| 02 | judge_call | completed | rubric evaluator | 304 ms | 620 / 110 |
| 03 | model_call | failed | fallback model | 1,204 ms | 900 / 0 |
5Claim boundaries
To keep claims honest, the trace surface is defined as much by what it refuses to record as by what it shows.
- Behavior inspection only: the trace shows what the system did, not everything it saw.
- No raw prompts, raw completions, or full transcripts are stored for this trace surface.
- Hashes support comparison and reproducibility without turning payloads into warehouse data.
- Dropped-event counts are visible, because an incomplete trace should say it is incomplete.
- This is not a tamper-proof audit log, a payload replay system, or a full reward-integrity claim.
6Shipped versus roadmap
The schema points toward a broader evaluation substrate; this note claims only the surface that can be inspected today. Table 2 separates the two so the paper can be defended on its current evidence.
| Capability | Status | Research reading |
|---|---|---|
| Episode with scored steps | Shipped | Durable entity carrying challenge, skill bundle, score percentage, and dimension scores. |
| Trace projection and viewer | Dark-launched | Served as a read-only projection over an episode and gated behind two default-off flags. |
| Captured event types | Narrow slice | Model calls and judge calls with status, model, latency, and token counts. Additional event types exist in the schema but are not yet emitted. |
| Payload references and per-event reward | Scaffold | Schema direction exists. Payload-backed reward integrity is not a current claim. |
| Multi-episode trajectory | Roadmap concept | Rendered, when needed, as a session view over many episode traces rather than a new stored trace header. |
Versalist Research (2026). Inspectable Scores: Episode Traces for AI Skill Evaluation.