CLI reference

BenchFlow command-line reference.

Edit on GitHub

BenchFlow uses a resource-verb pattern: bench <resource> <verb>.


bench agent

bench agent list

List all registered agents with their protocol and auth requirements.

bench agent list

bench agent show

Show details for a specific agent.

bench agent show gemini

bench eval

bench eval create

Create and run an evaluation. This is the primary command for running benchmarks.

# From YAML config
bench eval create -f benchmarks/tb2-gemini-baseline.yaml

# Inline
bench eval create \
  -t .ref/terminal-bench-2 \
  -a gemini \
  -m gemini-3.1-flash-lite-preview \
  -e daytona \
  -c 64 \
  --sandbox-setup-timeout 300
FlagDefaultDescription
--config, -fYAML config file
--tasks-dir, -tTask dir (single task with task.toml, or parent of many tasks)
--agent, -ageminiAgent name
--model, -mgemini-3.1-flash-lite-previewModel ID
--env, -edockerEnvironment: docker or daytona
--concurrency, -c4Max concurrent tasks (batch mode only)
--jobs-dir, -ojobsOutput directory
--sandbox-useragentSandbox user (null for root)
--sandbox-setup-timeout120Timeout in seconds for sandbox user setup

bench eval list

List completed evaluations from a jobs directory.

bench eval list jobs/

bench skills

bench skills eval

Evaluate a skill against its evals.json test cases.

bench skills eval skills/my-skill/ \
  -a gemini \
  -m gemini-3.1-flash-lite-preview \
  --env daytona

bench tasks

bench tasks init

Scaffold a new benchmark task.

bench tasks init my-new-task
bench tasks init my-new-task --dir tasks/

bench tasks check

Validate a task directory (Dockerfile, instruction.md, tests/).

bench tasks check tasks/my-task
bench tasks check tasks/my-task --rubric rubrics/quality.md

bench train

bench train create

Run a reward-based training sweep.

bench train create \
  -t tasks/ \
  -a gemini \
  --sweeps 5 \
  --export ./training-data

bench environment

bench environment create

Create an environment from a task directory (spins up sandbox).

bench environment create tasks/my-task --backend daytona

bench environment list

List active Daytona sandboxes.

bench environment list

YAML Config Format

task_dir: .ref/terminal-bench-2
environment: daytona
concurrency: 64
sandbox_setup_timeout: 300

scenes:
  - name: solve
    roles:
      - name: agent
        agent: gemini
        model: gemini-3.1-flash-lite-preview
    turns:
      - role: agent

Legacy flat (auto-converted)

task_dir: .ref/terminal-bench-2
agent: gemini
model: gemini-3.1-flash-lite-preview
environment: daytona
concurrency: 64
max_retries: 2
sandbox_setup_timeout: 300

Multi-scene (BYOS skill generation)

task_dir: tasks/
environment: daytona
concurrency: 10
sandbox_setup_timeout: 300

scenes:
  - name: skill-gen
    roles:
      - name: creator
        agent: gemini
        model: gemini-3.1-flash-lite-preview
    turns:
      - role: creator
        prompt: "Analyze the task and write a skill document to /app/generated-skill.md"

  - name: solve
    roles:
      - name: solver
        agent: gemini
        model: gemini-3.1-flash-lite-preview
    turns:
      - role: solver

Deprecated Commands

These still work but are hidden from --help:

Old commandReplacement
benchflow runbench eval create -t <task>
benchflow jobbench eval create -f <yaml>
benchflow agentsbench agent list
benchflow evalbench skills eval
benchflow metricsbench eval list --detail
benchflow view(planned: bench trajectory show)
benchflow cleanupbench environment list + delete
benchflow skills installSkills are folders, not packages