CLI reference

BenchFlow command-line reference.

Edit on GitHubllms.txt

BenchFlow uses a resource-verb pattern: bench <resource> <verb>.


bench agent

bench agent list

List all registered agents with their protocol and auth requirements.

bench agent list

bench agent show

Show details for a specific agent.

bench agent show gemini

bench run

bench run

Run one task directory with one agent. This is the most direct command for single-task local, Daytona, or Modal checks.

# Single task with Gemini on Daytona
bench run tasks/edit-pdf \
  --agent gemini \
  --model gemini-3.1-flash-lite-preview \
  --backend daytona

# Single task with mounted skills and the recommended skill nudge
bench run tasks/pdf-fix \
  --agent gemini \
  --model gemini-3.1-flash-lite-preview \
  --backend daytona \
  --skills-dir tasks/pdf-fix/environment/skills \
  --ae BENCHFLOW_SKILL_NUDGE=name
FlagDefaultDescription
TASK_DIRTask directory containing task.toml
--agent, -aclaude-agent-acpAgent name from the registry
--model, -mAgent defaultModel ID
--backend, -bdockerBackend: docker, daytona, or modal
--prompt, -pinstruction.mdPrompt text; repeat for multi-turn
--jobs-dir, -ojobsOutput directory
--agent-env, --aeAgent environment variable as KEY=VALUE; repeatable
--skills-dir, -sSkills directory to deploy into the sandbox
--sandbox-useragentNon-root sandbox user; pass none for root

When mounting skills, the recommended docs default is --ae BENCHFLOW_SKILL_NUDGE=name. It prepends a short hint telling the agent which skills are available and where to read them. More verbose modes are description and full. Omit the env var to leave BenchFlow's runtime default off.


bench eval

bench eval create

Create and run an evaluation. Use it for YAML configs and batch runs; it also accepts a single task directory.

# From YAML config
bench eval create -f benchmarks/skillsbench-claude-glm51.yaml

# From remote repo
bench eval create \
  --source-repo benchflow-ai/skillsbench \
  --source-path tasks \
  -a gemini \
  -m gemini-3.1-flash-lite-preview \
  -e daytona \
  -c 64 \
  --sandbox-setup-timeout 300

# From local directory
bench eval create -t ./tasks -a gemini -m gemini-3.1-flash-lite-preview
FlagDefaultDescription
--config, -fYAML config file
--tasks-dir, -tLocal task dir (single task with task.toml, or parent of many)
--source-repoRemote repo as org/repo (e.g. benchflow-ai/skillsbench)
--source-pathSubpath within the repo (e.g. tasks)
--source-refBranch or tag to clone (e.g. main)
--agent, -aclaude-agent-acpAgent name
--model, -mAgent defaultModel ID
--env, -edockerEnvironment: docker, daytona, or modal
--concurrency, -c4Max concurrent tasks (batch mode only)
--jobs-dir, -ojobsOutput directory
--sandbox-useragentSandbox user (null for root)
--sandbox-setup-timeout120Timeout in seconds for sandbox user setup
--skills-dir, -sSkills directory to deploy into each task sandbox
--agent-env, --aeAgent environment variable as KEY=VALUE; repeatable
--excludeTask name to exclude from batch; repeatable

bench eval list

List completed evaluations from a jobs directory.

bench eval list jobs/

bench skills

bench skills eval

Evaluate a skill against its evals.json test cases.

bench skills eval skills/my-skill/ \
  -a gemini \
  -m gemini-3.1-flash-lite-preview \
  --env daytona

bench tasks

bench tasks init

Scaffold a new benchmark task.

bench tasks init my-new-task
bench tasks init my-new-task --dir tasks/

bench tasks check

Validate a task directory (Dockerfile, instruction.md, tests/).

bench tasks check tasks/my-task
bench tasks check tasks/my-task --rubric rubrics/quality.md

bench train

bench train create

Run a reward-based training sweep.

bench train create \
  -t tasks/ \
  -a gemini \
  --sweeps 5 \
  --export ./training-data

bench environment

bench environment create

Create an environment from a task directory (spins up sandbox).

bench environment create tasks/my-task --backend daytona

bench environment list

List active Daytona sandboxes.

bench environment list

YAML Config Format

Batch config with skills and skill nudge

source:
  repo: benchflow-ai/skillsbench
  path: tasks
environment: daytona
concurrency: 64
sandbox_setup_timeout: 300
agent: gemini
model: gemini-3.1-flash-lite-preview
skills_dir: shared-skills/
agent_env:
  BENCHFLOW_SKILL_NUDGE: name
max_retries: 2

Multi-scene (BYOS skill generation)

Use the Python API for multi-scene experiments. bench eval create -f is for batch job configs; scene configs are loaded with benchflow.trial_yaml or built directly in Python.

task_dir: tasks/my-task
environment: daytona
sandbox_setup_timeout: 300

scenes:
  - name: skill-gen
    roles:
      - name: creator
        agent: gemini
        model: gemini-3.1-flash-lite-preview
    turns:
      - role: creator
        prompt: "Analyze the task and write a skill document to /app/generated-skill.md"

  - name: solve
    roles:
      - name: solver
        agent: gemini
        model: gemini-3.1-flash-lite-preview
    turns:
      - role: solver

Deprecated Commands

These still work but are hidden from --help:

Old commandReplacement
benchflow runbench run <task>
benchflow jobbench eval create -f <yaml>
benchflow agentsbench agent list
benchflow evalbench skills eval
benchflow metricsbench eval list --detail
benchflow view(planned: bench trajectory show)
benchflow cleanupbench environment list + delete
benchflow skills installSkills are folders, not packages