Inspect the public source, then install a private draft when it earns it.
Public skill bundles are the reusable execution layer behind My Skills. Review the published source first, then install a private copy for edits, experiments, and self-improvement.
Browse bundles
108 published bundles ready to inspect and install
Rl Failure Postmortem
Diagnose why an RL training run failed and what to change
Rl Vs Prompting Decision
Determine when prompt engineering, fine-tuning, or RL is the right approach
Rl Cost Estimation
Estimate total cost (compute, data, engineering time) for an RL project
Rl Paper Reading
Read and critically evaluate RL research papers, extract practical implications
Rl Experiment Design
Plan RL experiments: baselines, ablations, compute budgets, success criteria
Document Processing Env
Environments for extraction, classification, and transformation of business documents
Ticket Triage Env
Environments for support ticket routing, prioritization, and resolution
Crm Workflow Env
Environments mimicking CRM operations (Salesforce, HubSpot)
Data Pipeline Env
Environments for building and debugging ETL/ELT pipelines
Sql Generation Rl Env
Build environments where agents write SQL, execute it, and get scored on result correctness
Ui Task Specification
Formally specify UI tasks with clear start states, goal states, and evaluation criteria
Pixel Vs Dom Action Space
Trade-offs between pixel-level interaction and DOM-level interaction for UI agents
Browser Env Construction
Build instrumented browser environments with action logging and state capture
Repo Level Coding Env
Build environments where agents navigate and modify entire repositories, not just single files
Test Generation As Reward
Use test pass rates as automatic reward signals for code generation
Code Review Reward Design
Score code changes on correctness, style, security, and performance
Code Completion Rl Env
Build environments for training code completion models (à la Cursor's online RL)
Experience Replay Management
Maintain and curate experience replay buffers for continual RL training
Distribution Shift Detection
Detect when the production task distribution has drifted from the training distribution
Catastrophic Forgetting Mitigation
Prevent RL training from destroying previously learned capabilities
Online Rl From Production
Set up learning loops where production experience feeds back into training
Capability Regression Testing
Run broad capability evals before and after RL training to catch degradation
Overfitting Detection For Rl
Detect when RL training narrows capability (great on trained tasks, worse on everything else)
Domain Transfer Measurement
Quantify how much RL training on coding transfers to (say) data analysis or writing
Transfer Eval Design
Build evals that test whether RL training on task A improved performance on related task B
Risk Tier Classification
Classify agent skills by risk level (read-only vs. write vs. financial vs. external-facing) and apply appropriate controls
Audit Trail For Rl Decisions
Log every decision an RL agent makes in production with sufficient context for post-hoc review
Deployment Gating Pipeline
Build eval-gated deployment pipelines where RL-trained models must pass benchmarks before production
Data Exfiltration Prevention
Monitor and prevent agents from leaking sensitive data through tool calls
Skill Security Audit
Static and dynamic analysis of agent skill code for security vulnerabilities
Deceptive Alignment Detection
Test whether agents behave differently when they believe they're being evaluated vs. not
Rl Alignment Auditing
Verify that the policy optimizes for the intended objective, not a proxy
Action Space Sandboxing
Restrict agent actions to prevent irreversible or harmful operations
Safe Exploration Constraints
Define and enforce hard constraints on what agents can do during training rollouts
Reward Hacking Red Teaming
Systematically find ways an agent could game the reward function
Rollback And Versioning
Maintain and switch between agent versions when new RL training degrades performance
Production Monitoring For Rl Agents
Monitor deployed RL-trained agents for performance drift, reward hacking in the wild, and distribution shift
Continual Learning Pipeline
Set up recurring RL training loops that retrain as the workflow or data distribution shifts
Rl Roi Measurement
Quantify the business impact (time saved, error reduction, cost) of RL-trained agents
Ab Test Rl Policy
Design and run A/B tests comparing RL-trained agent vs. baseline in production
Environment Reset Engineering
Build reliable, fast environment reset mechanisms for episode boundaries
Environment Fidelity Validation
Verify that the sandbox environment faithfully reproduces production behavior
Client Data Onboarding
Ingest, clean, and transform client data into RL-ready formats
Mock Production System
Build a faithful replica of a client's production system (APIs, DB, auth) for safe RL training
Data Readiness Assessment
Evaluate whether the client has sufficient trajectory data, or whether collection needs to happen first
Success Metric Extraction
Work with stakeholders to convert vague "it should work better" into measurable, scorable outcomes
Baseline Agent Benchmarking
Measure current agent performance on the target workflow before RL intervention
Rl Feasibility Assessment
Determine whether a workflow is actually amenable to RL improvement (clear rewards, sufficient volume, safe to explore)
Workflow Audit
Map an enterprise workflow end-to-end: inputs, decisions, tools, outputs, success criteria
Model Versioning For Rl
Track and switch between reference model, current policy, and reward model versions during training
Vllm For Rl
Configure vLLM or similar engines for RL workloads (batched generation, multiple completions)
High Throughput Rollout Serving
Serve models at high throughput for RL rollout collection (not just user-facing latency)
Compute Budgeting For Rl
Estimate and optimize GPU hours needed for RL training runs
Checkpoint Selection
Choose the best model checkpoint based on eval performance, not just training metrics
Training Stability Debugging
Diagnose and fix common RL training failures: reward collapse, mode collapse, KL explosion
Kl Divergence Management
Control how far the policy drifts from the reference model during training
Reward Model Training
Train reward models from human preference data, handle label noise and distribution shift
Rl Hyperparameter Tuning
Tune learning rates, KL penalties, reward scaling, batch sizes for RL stability
Distributed Rl Training
Shard training across multiple GPUs/nodes with proper gradient synchronization
Manage Rl Rollouts
Orchestrate parallel agent rollouts across environments at scale
Implement Constitutional Ai
Self-critique and revision loops using model-generated feedback
Implement Rlhf Pipeline
End-to-end: collect preferences → train reward model → optimize policy
Online Vs Offline Rl Tradeoffs
When to use online rollouts vs. offline datasets, and how to blend
Implement Rejection Sampling
Best-of-N sampling with a reward model; simplest "RL" that actually works
Implement Reinforce With Baseline
Classic REINFORCE with variance reduction, the foundation of policy gradient methods
Implement Grpo
Build Group Relative Policy Optimization as used in DeepSeek-R1
Implement Dpo
Build Direct Preference Optimization, understand when it outperforms PPO
Implement Ppo
Build Proximal Policy Optimization from scratch, understand clipping and advantage estimation
Eval From Production Failures
Convert real production failures into new eval cases automatically
Multi Model Eval Harness
Run the same eval suite across Haiku/Sonnet/Opus (or GPT-4/Claude/Gemini) and compare
Eval Versioning And Regression
Track eval suite changes over time, detect regressions when evals are updated
Domain Specific Eval Design
Build evals for specialized verticals (legal, medical, finance, engineering)
Eval Contamination Prevention
Ensure training data and eval data don't overlap
Adversarial Eval Generation
Create evals specifically designed to find failure modes and edge cases
Eval Saturation Detection
Identify when a model has maxed out an eval and needs harder/different benchmarks
Eval Coverage Analysis
Measure whether your eval suite covers the actual distribution of production tasks
Build Fuzzy Eval
Design evals for tasks with multiple valid solutions (writing, design, open-ended code)
Build Deterministic Eval
Create evals with unambiguous, programmatically verifiable correct answers
Outcome Vs Process Reward Tradeoff
When to reward final results vs. intermediate steps, and how to blend both
Reward Calibration
Ensure reward functions produce consistent, well-scaled signals across different task types and difficulties
Human Feedback Collection
Design interfaces and protocols for collecting human preference judgments at scale
Reward Hacking Detection
Identify when agents exploit reward function loopholes to get high scores without doing the task correctly
Reward Shaping
Add intermediate reward signals that guide learning without changing the optimal policy
Process Reward Modeling
Score intermediate reasoning steps, not just final outcomes
Composite Reward Design
Combine multiple reward signals (correctness, efficiency, style, safety) into a single scalar
Llm As Judge Reward
Use a language model to score agent outputs against specifications or rubrics
Graded Rubric Reward
Translate qualitative rubrics into multi-dimensional scoring functions with partial credit
Binary Outcome Reward
Design pass/fail reward signals (code compiles, test passes, form submitted correctly)
Offline Dataset Curation
Build high-quality static datasets from historical trajectories for offline RL or behavior cloning
Trajectory Format Standardization
Convert heterogeneous log formats into a unified trajectory schema (state, action, reward, metadata)
Trajectory Anonymization
Strip PII, credentials, and sensitive business data from trajectories while preserving RL-relevant structure
Trajectory Filtering
Score and filter trajectories by quality, remove corrupted/incomplete episodes
Capture Agent Trajectories
Log agent rollouts with full state-action-reward-next_state tuples, tool calls, and timing
Capture Human Trajectories
Instrument production tools to log human expert actions, states, and outcomes as RL-ready trajectories
Multi Step Task Decomposition
Break complex enterprise workflows into subtask chains with intermediate checkpoints
Synthetic Data Augmentation
Generate realistic variations of workflow data (user inputs, edge cases, adversarial inputs) without real PII
Sop To Task Parser
Convert natural language SOPs and runbooks into structured, machine-executable task specifications
Task Difficulty Calibration
Score and bucket tasks by difficulty using baseline agent performance
Edge Case Mining
Extract rare but high-impact failure modes from production logs to create targeted task sets
Curriculum Design
Order tasks by difficulty, introduce new complexity dimensions progressively
Generate Task Variations
Programmatically produce 10K–100K+ task instances from templates, SOPs, and historical logs
Instrument Action Space
Define, constrain, and document the valid action space an agent can take within an environment
Build Stateful Env
Handle environments with persistent state across episodes (databases, file systems, user sessions)
Build Multi Tool Env
Compose environments spanning multiple tools (IDE + terminal + browser + DB) into a single coherent action space
Build Cli Env
Create terminal/shell environments with filesystem state, command history, and outcome verification
Build Codebase Env
Set up repo-level coding environments with test harnesses, linting, compilation feedback loops
Build Api Harness
Wrap real or mock APIs into instrumented RL-ready surfaces with deterministic reset, state capture, and action logging
Build Ui Sandbox
Construct browser-based sandboxed environments where agents interact with realistic UI surfaces (forms, dashboards, multi-step wizards)