Apr 7, 2026Release

ClawsBench paper on arXiv

ClawsBench paper figure 1 — state-based evaluation pipeline

Our paper on ClawsBench is up at arXiv:2604.05172.

Headline finding: scaffolding (skills + meta-prompts) dominates model capability. Without scaffolding, all six frontier models score 0–8% Task Success Rate. With it, the top five reach 53–63% TSR and become statistically indistinguishable under Holm–Bonferroni.

Capability and safety scored together: every trial gets a Task Success Rate and an Unsafe Action Rate. The best model on TSR (Opus, 63%) also ties for the worst on UAR (23%).

Paper (arXiv)
GitHub
Hugging Face

← All news