For agent researchers and AI engineers who want reproducible evaluation loops, not demo-grade scripts.
Have a challenge idea?