EnvironmentWorkplaceCapability + Safety

ClawsBench

Wire-compatible mock workplaces for agents.

Apr 2026 · Xiangyi Li, Kyoung Whan Choe, Yimin Liu, +12 contributors

ClawsBench TSR and UAR forest plot: capability vs safety across six frontier models with 95% confidence intervals

ClawsBench is five high-fidelity mock workplaces — claw-gmail, claw-gcal, claw-gdocs, claw-gdrive, claw-slack — wire-compatible with the upstream Google Workspace `gws` CLI and the Slack Web API. The mock ships the upstream `gws` binary verbatim and the `gws` skill files come straight from `googleworkspace/cli`. Flip one env var (`GOOGLE_WORKSPACE_CLI_API_BASE_URL`) and the same agent, the same skills, the same payloads that ship against real google.com run unmodified against a sandboxed replica with full state and real failure modes.

44 tasks, broken into 30 single-service and 14 cross-service, including 24 safety-critical scenarios. Every trial is scored on Task Success Rate (TSR) and Unsafe Action Rate (UAR) so capability and safety don't trade off behind a single number — the best model on TSR (Opus 4.6, 63%) is also tied for the worst on UAR (23%).

Headline finding: scaffolding (skills + meta-prompts) dominates model capability. Without scaffolding, all six frontier models score 0–8% TSR. With it, the top five (Opus 4.6, Pro 3.1, Sonnet 4.6, GLM-5.1, GPT-5.4) reach 53–63% TSR and become statistically indistinguishable under Holm–Bonferroni. The +39 to +63pp scaffolding lift dwarfs the spread between top models.

Multi-service tasks are harder and more dangerous than single-service: +23pp TSR gap and +10pp UAR. Eight recurring rogue behaviors observed across runs include sandbox escalation, prompt-injection compliance, and unauthorized contract modification — exactly what the wire-compatibility makes possible to study, since the same exploits would target real Workspace deployments.

Open source. Paper at arXiv:2604.05172, dataset on Hugging Face, live site at clawsbench.benchflow.ai.

Figures

ClawsBench evaluation pipeline diagram
State-based evaluation flow: agent harness layer → mock services backed by SQLite → evaluator yielding performance and safety scores.
Example ClawsBench task walkthrough across Slack, Calendar, Docs
Walkthrough of the multi-rebalance-on-call-rotation task: the agent reads Slack, Calendar, and Docs across four steps; final state validated, score 1.0 (full completion, no violations).
claw-gmail mock interface
claw-gmail · 62 endpoints
claw-gcal mock interface
claw-gcal · 22 endpoints
claw-gdrive mock interface
claw-gdrive · 31 endpoints
claw-gdocs mock interface
claw-gdocs · 26 endpoints
claw-slack mock interface
claw-slack · 45 endpoints