BenchFlow documentation
The runtime that ships SkillsBench, ClawsBench, and verified ACP agents.
Getting started
Install BenchFlow and run your first agent against a verifiable task.
Concepts
Tasks, harnesses, agents, environments, scenes, verifiers — the core BenchFlow vocabulary.
Authoring tasks
How to write a verifiable BenchFlow task end-to-end.
Skill evals
Run skill evals through the BenchFlow runtime — what gets measured and how.
Progressive disclosure
Lifecycle for environments that reveal information across rounds.
Sandbox hardening
How BenchFlow sandboxes prevent oracle leakage and other failure modes.
Use cases
What BenchFlow is used for: evals, post-training, dataset curation.
CLI reference
BenchFlow command-line reference.
Python API
BenchFlow Python SDK reference.