Release
ClawsBench paper on arXiv
Our paper on ClawsBench is up at arXiv:2604.05172.
Headline finding: scaffolding (skills + meta-prompts) dominates model capability. Without scaffolding, all six frontier models score 0–8% Task Success Rate. With it, the top five reach 53–63% TSR and become statistically indistinguishable under Holm–Bonferroni.
Capability and safety scored together: every trial gets a Task Success Rate and an Unsafe Action Rate. The best model on TSR (Opus, 63%) also ties for the worst on UAR (23%).