Authoring tasks

How to write a verifiable BenchFlow task end-to-end.

Edit on GitHub

A BenchFlow task packages an instruction, a sandboxed environment, and a verifier into a directory that BenchFlow runs and scores automatically.


Directory layout

my-task/
├── task.toml              # timeouts, resources, metadata
├── instruction.md         # what the agent must do
├── environment/
│   └── Dockerfile         # sandbox image
├── tests/
│   └── test.sh            # verifier entry point
└── solution/              # optional — reference/oracle solution
    └── solve.sh

tests/ may also include test_outputs.py (pytest module called by test.sh).


task.toml

version = "1.0"

[metadata]                   # optional, freeform
author_name = "alice"
difficulty  = "easy"         # easy / medium / hard
category    = "programming"
tags        = ["bash", "files"]

[agent]
timeout_sec = 300            # REQUIRED — seconds before agent is killed
# user = "agent"             # optional — run agent as this user/UID

[verifier]
timeout_sec = 120            # optional (default 600)

[environment]
cpus            = 1          # default 1
memory_mb       = 2048       # default 2048
storage_mb      = 10240      # default 10240
allow_internet  = false      # default true
env             = { OPENAI_API_KEY = "${OPENAI_API_KEY}" }  # host vars to inject

Built-in mock services — if the Dockerfile references a service binary (claw-gmail, claw-slack, claw-gcal, claw-gdoc, claw-gdrive), BenchFlow starts it automatically. No [services] section needed.

Install tooling to shared prefixes, not /root — when a task image ships Node.js, Python tools, or agent binaries that the sandbox user must execute, install them to /usr/local/bin, /usr/local/lib, or /opt, not /root/.nvm or /root/.local/bin. setup_sandbox_user() creates the non-root user, prepares small config/auth dirs, and chowns the workspace — it does not clone /root into the sandbox home. Legacy images that already install tools under /root still work via a narrow symlink fallback, but shared prefixes are the supported path. Pre-creating the sandbox user in the Dockerfile is an optional speedup, not a requirement.


instruction.md

The first prompt sent to the agent. Write it as you would for a skilled developer:

  • State the precise goal in the first sentence.
  • Name exact files or paths the agent must create or modify.
  • Specify constraints (no external libraries, must pass existing tests, etc.).
  • Don't mention the verifier or reward.txt — those are internal.

Multi-turn prompts — use a Scene with multiple Turns. A None prompt means "use instruction.md":

from benchflow.trial import TrialConfig, Scene, Role, Turn

config = TrialConfig(
    task_path="tasks/my-task",
    scenes=[Scene(
        roles=[Role("agent", "gemini", "gemini-3.1-flash-lite-preview")],
        turns=[
            Turn("agent"),                                        # instruction.md
            Turn("agent", "Review your solution and fix any test failures."),
        ],
    )],
    backend="daytona",
)
result = await bf.run(config)

Verifier contract (tests/test.sh)

After the agent finishes, Harbor copies tests/ to /tests/ and runs /tests/test.sh. The working directory is the Dockerfile's WORKDIR (typically /app/ in the example Dockerfile below).

Your script must write a single float (0.0–1.0) to /logs/verifier/reward.txt.

PathContents
/app/Agent's working directory
/tests/Your tests/ directory
/solution/solution/ (oracle runs only)
/logs/verifier/Write reward.txt (and optionally ctrf.json) here

Pure bash verifier

#!/bin/bash
REWARD=0
if [ -f /app/hello.txt ] && [ "$(cat /app/hello.txt | tr -d '\n')" = "Hello, world!" ]; then
    REWARD=1
fi
echo "$REWARD" > /logs/verifier/reward.txt

pytest verifier

#!/bin/bash
curl -LsSf https://astral.sh/uv/0.9.7/install.sh | sh
source $HOME/.local/bin/env

uvx \
  --with pytest==8.4.1 \
  --with pytest-json-ctrf==0.3.5 \
  pytest --ctrf /logs/verifier/ctrf.json /tests/test_outputs.py -rA

if [ $? -eq 0 ]; then echo 1; else echo 0; fi > /logs/verifier/reward.txt

Partial credit

python3 -c "print($PASSED / $TOTAL)" > /logs/verifier/reward.txt

Security: don't let the agent write to /logs/verifier/reward.txt or modify /tests/test.sh. For tasks running arbitrary code, use allow_internet = false and verify output files only.


solution/ (optional)

Include when you want to verify the task is solvable or provide a reference implementation. When BenchFlow runs with -a oracle, it copies solution/ to /solution/ and runs solution/solve.sh instead of an ACP agent.

solve.sh has the same filesystem access as the agent — write only to /app/, not to /logs/verifier/.

#!/bin/bash
echo "Hello, world!" > /app/hello.txt

CLI

# Scaffold a new task
bench tasks init my-task
bench tasks init my-task --no-pytest --no-solution

# Validate structure
bench tasks check tasks/my-task/

# Confirm oracle gets reward = 1.0
bench eval create -t tasks/my-task/ -a oracle -e docker

# Run a real agent
bench eval create -t tasks/my-task/ -a gemini -e daytona

bench tasks check validates that task.toml, instruction.md (non-empty), environment/Dockerfile, and tests/ (non-empty) all exist, and that [agent].timeout_sec is set. Exits with code 1 on failure (CI-friendly).


Worked example — write-fizzbuzz

# task.toml
version = "1.0"
[metadata]
difficulty = "easy"
tags = ["python"]
[agent]
timeout_sec = 180
[verifier]
timeout_sec = 60
# instruction.md
Write a file `fizzbuzz.py` defining:

    def fizzbuzz(n: int) -> str

Return "FizzBuzz" / "Fizz" / "Buzz" / str(n) for divisibility by 15 / 3 / 5 / none.
No __main__ block, no print statements.
# environment/Dockerfile
FROM ubuntu:24.04
RUN apt-get update -qq && apt-get install -y -qq python3 curl && rm -rf /var/lib/apt/lists/*
WORKDIR /app
RUN mkdir -p /logs/verifier /logs/agent /logs/artifacts
# tests/test_outputs.py
import importlib.util
from pathlib import Path

def _load():
    path = Path("/app/fizzbuzz.py")
    assert path.exists()
    spec = importlib.util.spec_from_file_location("fizzbuzz", path)
    mod = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(mod)
    return mod.fizzbuzz

def test_fizz():    assert _load()(3) == "Fizz"
def test_buzz():    assert _load()(5) == "Buzz"
def test_fizzbuzz():assert _load()(15) == "FizzBuzz"
def test_number():  assert _load()(7) == "7"
# solution/solve.sh
cat > /app/fizzbuzz.py << 'EOF'
def fizzbuzz(n: int) -> str:
    if n % 15 == 0: return "FizzBuzz"
    if n % 3 == 0:  return "Fizz"
    if n % 5 == 0:  return "Buzz"
    return str(n)
EOF