A BenchFlow task packages an instruction, a sandboxed environment, and a verifier into a directory that BenchFlow runs and scores automatically.
Directory layout
my-task/
├── task.toml # timeouts, resources, metadata
├── instruction.md # what the agent must do
├── environment/
│ └── Dockerfile # sandbox image
├── tests/
│ └── test.sh # verifier entry point
└── solution/ # optional — reference/oracle solution
└── solve.sh
tests/ may also include test_outputs.py (pytest module called by test.sh).
task.toml
version = "1.0"
[metadata] # optional, freeform
author_name = "alice"
difficulty = "easy" # easy / medium / hard
category = "programming"
tags = ["bash", "files"]
[agent]
timeout_sec = 300 # REQUIRED — seconds before agent is killed
# user = "agent" # optional — run agent as this user/UID
[verifier]
timeout_sec = 120 # optional (default 600)
[environment]
cpus = 1 # default 1
memory_mb = 2048 # default 2048
storage_mb = 10240 # default 10240
allow_internet = false # default true
env = { OPENAI_API_KEY = "${OPENAI_API_KEY}" } # host vars to inject
Built-in mock services — if the Dockerfile references a service binary (claw-gmail, claw-slack, claw-gcal, claw-gdoc, claw-gdrive), BenchFlow starts it automatically. No [services] section needed.
Install tooling to shared prefixes, not /root — when a task image ships Node.js, Python tools, or agent binaries that the sandbox user must execute, install them to /usr/local/bin, /usr/local/lib, or /opt, not /root/.nvm or /root/.local/bin. setup_sandbox_user() creates the non-root user, prepares small config/auth dirs, and chowns the workspace — it does not clone /root into the sandbox home. Legacy images that already install tools under /root still work via a narrow symlink fallback, but shared prefixes are the supported path. Pre-creating the sandbox user in the Dockerfile is an optional speedup, not a requirement.
instruction.md
The first prompt sent to the agent. Write it as you would for a skilled developer:
- State the precise goal in the first sentence.
- Name exact files or paths the agent must create or modify.
- Specify constraints (no external libraries, must pass existing tests, etc.).
- Don't mention the verifier or
reward.txt— those are internal.
Multi-turn prompts — use a Scene with multiple Turns. A None prompt means "use instruction.md":
from benchflow.trial import TrialConfig, Scene, Role, Turn
config = TrialConfig(
task_path="tasks/my-task",
scenes=[Scene(
roles=[Role("agent", "gemini", "gemini-3.1-flash-lite-preview")],
turns=[
Turn("agent"), # instruction.md
Turn("agent", "Review your solution and fix any test failures."),
],
)],
backend="daytona",
)
result = await bf.run(config)
Verifier contract (tests/test.sh)
After the agent finishes, Harbor copies tests/ to /tests/ and runs /tests/test.sh. The working directory is the Dockerfile's WORKDIR (typically /app/ in the example Dockerfile below).
Your script must write a single float (0.0–1.0) to /logs/verifier/reward.txt.
| Path | Contents |
|---|---|
/app/ | Agent's working directory |
/tests/ | Your tests/ directory |
/solution/ | solution/ (oracle runs only) |
/logs/verifier/ | Write reward.txt (and optionally ctrf.json) here |
Pure bash verifier
#!/bin/bash
REWARD=0
if [ -f /app/hello.txt ] && [ "$(cat /app/hello.txt | tr -d '\n')" = "Hello, world!" ]; then
REWARD=1
fi
echo "$REWARD" > /logs/verifier/reward.txt
pytest verifier
#!/bin/bash
curl -LsSf https://astral.sh/uv/0.9.7/install.sh | sh
source $HOME/.local/bin/env
uvx \
--with pytest==8.4.1 \
--with pytest-json-ctrf==0.3.5 \
pytest --ctrf /logs/verifier/ctrf.json /tests/test_outputs.py -rA
if [ $? -eq 0 ]; then echo 1; else echo 0; fi > /logs/verifier/reward.txt
Partial credit
python3 -c "print($PASSED / $TOTAL)" > /logs/verifier/reward.txt
Security: don't let the agent write to /logs/verifier/reward.txt or modify /tests/test.sh. For tasks running arbitrary code, use allow_internet = false and verify output files only.
solution/ (optional)
Include when you want to verify the task is solvable or provide a reference implementation. When BenchFlow runs with -a oracle, it copies solution/ to /solution/ and runs solution/solve.sh instead of an ACP agent.
solve.sh has the same filesystem access as the agent — write only to /app/, not to /logs/verifier/.
#!/bin/bash
echo "Hello, world!" > /app/hello.txt
CLI
# Scaffold a new task
bench tasks init my-task
bench tasks init my-task --no-pytest --no-solution
# Validate structure
bench tasks check tasks/my-task/
# Confirm oracle gets reward = 1.0
bench eval create -t tasks/my-task/ -a oracle -e docker
# Run a real agent
bench eval create -t tasks/my-task/ -a gemini -e daytona
bench tasks check validates that task.toml, instruction.md (non-empty), environment/Dockerfile, and tests/ (non-empty) all exist, and that [agent].timeout_sec is set. Exits with code 1 on failure (CI-friendly).
Worked example — write-fizzbuzz
# task.toml
version = "1.0"
[metadata]
difficulty = "easy"
tags = ["python"]
[agent]
timeout_sec = 180
[verifier]
timeout_sec = 60
# instruction.md
Write a file `fizzbuzz.py` defining:
def fizzbuzz(n: int) -> str
Return "FizzBuzz" / "Fizz" / "Buzz" / str(n) for divisibility by 15 / 3 / 5 / none.
No __main__ block, no print statements.
# environment/Dockerfile
FROM ubuntu:24.04
RUN apt-get update -qq && apt-get install -y -qq python3 curl && rm -rf /var/lib/apt/lists/*
WORKDIR /app
RUN mkdir -p /logs/verifier /logs/agent /logs/artifacts
# tests/test_outputs.py
import importlib.util
from pathlib import Path
def _load():
path = Path("/app/fizzbuzz.py")
assert path.exists()
spec = importlib.util.spec_from_file_location("fizzbuzz", path)
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)
return mod.fizzbuzz
def test_fizz(): assert _load()(3) == "Fizz"
def test_buzz(): assert _load()(5) == "Buzz"
def test_fizzbuzz():assert _load()(15) == "FizzBuzz"
def test_number(): assert _load()(7) == "7"
# solution/solve.sh
cat > /app/fizzbuzz.py << 'EOF'
def fizzbuzz(n: int) -> str:
if n % 15 == 0: return "FizzBuzz"
if n % 3 == 0: return "Fizz"
if n % 5 == 0: return "Buzz"
return str(n)
EOF