About
A frontier environment lab for AI agents.
BenchFlow is a data lab. We build the environments AI agents need to learn — and be evaluated on — real computer work, not static prompts.
The framing follows a simple progression. The first generation of AI data was labels. The second was post-training data: SFT, preferences, reward labels, short trajectories. The third is environments: stateful workplaces with services, files, tools, verifiers, traces, and replay. Models in 2026 don’t get better from more static prompts; they get better from running through realistic environments and being judged on the whole workflow.
We ship three connected projects: SkillsBench for procedural skills, ClawsBench for high-fidelity simulated workplaces, and BenchFlow, the runtime that runs both.
BenchFlow started in late 2024, before the current agent-eval wave. We work with frontier model labs, the agent skills community, and academic partners through Agent Skills ’26 at CAIS and the SkillsBench 1.0 launch in May.