Release

ClawsBench paper on arXiv

Our paper on ClawsBench is up at arXiv:2604.05172.

Headline finding: scaffolding (skills + meta-prompts) dominates model capability. Without scaffolding, all six frontier models score 0–8% Task Success Rate. With it, the top five reach 53–63% TSR and become statistically indistinguishable under Holm–Bonferroni.

Capability and safety scored together: every trial gets a Task Success Rate and an Unsafe Action Rate. The best model on TSR (Opus, 63%) also ties for the worst on UAR (23%).

← All news