BenchFlow
ResearchNewsDocsAgents

Documentation

BenchFlow

Getting startedConceptsAuthoring tasksSkill evalsProgressive disclosureSandbox hardeningUse cases

reference

CLI referencePython API

Other docs

SkillsBench →
HomeDocsBenchFlow

BenchFlow documentation

The runtime that ships SkillsBench, ClawsBench, and verified ACP agents.

  • Getting started

    Install BenchFlow and run your first agent against a verifiable task.

  • Concepts

    Tasks, harnesses, agents, environments, scenes, verifiers — the core BenchFlow vocabulary.

  • Authoring tasks

    How to write a verifiable BenchFlow task end-to-end.

  • Skill evals

    Run skill evals through the BenchFlow runtime — what gets measured and how.

  • Progressive disclosure

    Lifecycle for environments that reveal information across rounds.

  • Sandbox hardening

    How BenchFlow sandboxes prevent oracle leakage and other failure modes.

  • Use cases

    What BenchFlow is used for: evals, post-training, dataset curation.

  • CLI reference

    BenchFlow command-line reference.

  • Python API

    BenchFlow Python SDK reference.

BenchFlow

A frontier environment lab for AI agents. SkillsBench, ClawsBench, and the BenchFlow runtime — open source.

Projects

  • SkillsBench
  • ClawsBench
  • BenchFlow runtime
  • HuggingFace org

Site

  • Research
  • News
  • Docs
  • Verified agents
  • About

Ecosystem

  • Agent Skills ’26 workshop
  • ClawsBench paper (arXiv)
  • GitHub org
  • Discord

© 2026 BenchFlow · A frontier environment lab

xiangyi@benchflow.ai