ClawsBench
Wire-compatible mock workplaces for agents.
Apr 2026 · Xiangyi Li, Kyoung Whan Choe, Yimin Liu, +12 contributors

ClawsBench is five high-fidelity mock workplaces — claw-gmail, claw-gcal, claw-gdocs, claw-gdrive, claw-slack — wire-compatible with the upstream Google Workspace `gws` CLI and the Slack Web API. The mock ships the upstream `gws` binary verbatim and the `gws` skill files come straight from `googleworkspace/cli`. Flip one env var (`GOOGLE_WORKSPACE_CLI_API_BASE_URL`) and the same agent, the same skills, the same payloads that ship against real google.com run unmodified against a sandboxed replica with full state and real failure modes.
44 tasks, broken into 30 single-service and 14 cross-service, including 24 safety-critical scenarios. Every trial is scored on Task Success Rate (TSR) and Unsafe Action Rate (UAR) so capability and safety don't trade off behind a single number — the best model on TSR (Opus 4.6, 63%) is also tied for the worst on UAR (23%).
Headline finding: scaffolding (skills + meta-prompts) dominates model capability. Without scaffolding, all six frontier models score 0–8% TSR. With it, the top five (Opus 4.6, Pro 3.1, Sonnet 4.6, GLM-5.1, GPT-5.4) reach 53–63% TSR and become statistically indistinguishable under Holm–Bonferroni. The +39 to +63pp scaffolding lift dwarfs the spread between top models.
Multi-service tasks are harder and more dangerous than single-service: +23pp TSR gap and +10pp UAR. Eight recurring rogue behaviors observed across runs include sandbox escalation, prompt-injection compliance, and unauthorized contract modification — exactly what the wire-compatibility makes possible to study, since the same exploits would target real Workspace deployments.
Open source. Paper at arXiv:2604.05172, dataset on Hugging Face, live site at clawsbench.benchflow.ai.
Figures






