Read the source. Install what you trust.
Each skill bundle packages a reusable agent behavior — a prompt, supporting files, and evaluation criteria. Browse the public catalog, review the full source, then install a private copy you can edit and experiment with.
Browse bundles
108 published bundles ready to inspect and install
Eval Contamination Prevention
Ensure training data and eval data don't overlap
Adversarial Eval Generation
Create evals specifically designed to find failure modes and edge cases
Eval Saturation Detection
Identify when a model has maxed out an eval and needs harder/different benchmarks
Eval Coverage Analysis
Measure whether your eval suite covers the actual distribution of production tasks
Build Fuzzy Eval
Design evals for tasks with multiple valid solutions (writing, design, open-ended code)
Build Deterministic Eval
Create evals with unambiguous, programmatically verifiable correct answers
Outcome VS Process Reward Tradeoff
When to reward final results vs. intermediate steps, and how to blend both
Reward Calibration
Ensure reward functions produce consistent, well-scaled signals across different task types and difficulties
Human Feedback Collection
Design interfaces and protocols for collecting human preference judgments at scale
Reward Hacking Detection
Identify when agents exploit reward function loopholes to get high scores without doing the task correctly
Reward Shaping
Add intermediate reward signals that guide learning without changing the optimal policy
Process Reward Modeling
Score intermediate reasoning steps, not just final outcomes