Getting started

Install BenchFlow and run your first agent against a verifiable task.

Edit on GitHubllms.txt

A 5-minute path from install to first eval.

Prerequisites

  • Python 3.12+
  • uv
  • Docker for local sandboxes, DAYTONA_API_KEY for Daytona cloud runs, or Modal auth for Modal-backed runs
  • An API key or subscription/OAuth auth for at least one agent (see below)

Install

uv tool install benchflow

This gives you the benchflow (alias bench) CLI plus the Python SDK. To install for editable development:

git clone https://github.com/benchflow-ai/benchflow
cd benchflow
uv sync --extra dev --locked

Auth: OAuth, long-lived token, or API key

You don't need an API key if you're a Claude / Codex / Gemini subscriber. Three options, pick one per agent:

Option 1 — Subscription OAuth from host CLI login

If you've logged into the agent's CLI on your host (claude login, codex --login, gemini interactive flow), benchflow picks up the credential file and copies it into the sandbox. No API key billing.

AgentHow to log in on the hostWhat benchflow detectsReplaces env var
claude-agent-acpclaude login (Claude Code CLI)~/.claude/.credentials.jsonANTHROPIC_API_KEY
codex-acpcodex --login (Codex CLI)~/.codex/auth.jsonOPENAI_API_KEY
geminigemini (interactive login)~/.gemini/oauth_creds.jsonGEMINI_API_KEY

When benchflow finds the detect file, you'll see:

Using host subscription auth (no ANTHROPIC_API_KEY set)

Option 2 — Long-lived OAuth token (CI / headless)

For CI pipelines, scripts, or anywhere the host can't run an interactive browser login, generate a 1-year OAuth token with claude setup-token and export it:

claude setup-token            # walks you through browser auth, prints a token
export CLAUDE_CODE_OAUTH_TOKEN=<paste-token>

benchflow auto-inherits CLAUDE_CODE_OAUTH_TOKEN from your shell into the sandbox; the Claude CLI inside reads it directly. Same auth precedence as plain claude (Anthropic docs): API keys override OAuth tokens, so unset ANTHROPIC_API_KEY if you want the token to win.

claude setup-token only authenticates Claude. Codex and Gemini do not have an equivalent today — use Option 1 (host login) or Option 3 (API key).

Option 3 — API key

Set the API-key env var directly. Works with every agent:

export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...
export GEMINI_API_KEY=...
export LLM_API_KEY=...           # OpenHands / LiteLLM-compatible providers

benchflow auto-inherits well-known API key env vars from your shell into the sandbox.

Precedence

If multiple credentials are set, benchflow / the agent CLI uses (high to low): cloud provider creds → ANTHROPIC_AUTH_TOKENANTHROPIC_API_KEYapiKeyHelperCLAUDE_CODE_OAUTH_TOKEN → host subscription OAuth. To force a lower-priority option, unset the higher one in your shell before running.

Run your first eval

# Single task from a remote repo
GEMINI_API_KEY=... bench eval create \
  --source-repo benchflow-ai/skillsbench \
  --source-path tasks/edit-pdf \
  -a gemini \
  -m gemini-3.1-pro-preview \
  -e docker

# Single task from local path
GEMINI_API_KEY=... bench eval create \
  -t tasks/edit-pdf \
  -a gemini \
  -m gemini-3.1-pro-preview \
  -e daytona \
  --skills-dir tasks/edit-pdf/environment/skills \
  --ae BENCHFLOW_SKILL_NUDGE=name

# A whole batch from YAML config
bench eval create -f benchmarks/skillsbench-claude-glm51.yaml

# Batch from remote repo with concurrency
GEMINI_API_KEY=... bench eval create \
    --source-repo benchflow-ai/skillsbench --source-path tasks \
    -a gemini -m gemini-3.1-pro-preview -e daytona -c 32

# List the registered agents
bench agent list

bench eval create is the primary command for running evaluations — it works for single tasks, batch runs, and remote repos. Use --source-repo <org/repo> --source-path <subpath> to fetch from a remote repo, -t <tasks-dir> for a local directory, or -f <config.yaml> for a YAML config. Results land under jobs/<job-name>/<trial-name>/result.json for the verifier output, trajectory/acp_trajectory.jsonl for the full agent trace.

When you mount skills, use BENCHFLOW_SKILL_NUDGE=name as the default docs option. It tells the agent which skills are available and where to read them. For more context in the prompt, use description or full; omit the env var to keep BenchFlow's runtime default off.

Run from Python

The CLI is a thin shim over the Python API. For programmatic use:

import benchflow as bf
from benchflow.trial import TrialConfig, Scene
from benchflow.task_download import resolve_source

config = TrialConfig(
    task_path=resolve_source("benchflow-ai/skillsbench", path="tasks/edit-pdf"),
    scenes=[Scene.single(agent="gemini", model="gemini-3.1-pro-preview")],
    environment="docker",
)
result = await bf.run(config)
print(result.rewards)         # {'reward': 1.0}
print(result.n_tool_calls)

Trial is decomposable — invoke each lifecycle phase individually for custom flows. See Concepts: trial lifecycle.

If you want to…Read
Understand the model — Trial, Scene, Role, VerifierConcepts
Author a taskTask authoring
Run multi-agent patterns (coder/reviewer, simulated user, BYOS)Use cases
Run multi-round single-agent (progressive disclosure)Progressive disclosure
Evaluate skills, not tasksSkill eval
Understand the security modelSandbox hardening
CLI flags + commandsCLI reference
Python API surfacePython API reference