Update

How we improved SkillsBench v1.1 scores by 69.4% using env0

Bingran YouBingran You

What is env0

LLM agents are increasingly deployed to automate productivity tasks — email triage, meeting scheduling, document management — but evaluating them on live services is risky due to potentially irreversible changes. Existing benchmarks rely on simplified environments and fail to capture realistic, stateful, multi-service workflows. So we made env0 to address this. env0 inherits from the mock environments in ClawsBench, a benchmark for evaluating and improving LLM agents in realistic productivity settings. The current v0.1 version has five high-fidelity mock services that replicate real Google Workspace and Slack APIs with full state management and deterministic snapshot/restore.

We run post-train experiments to demonstrate how env0 can be used to improve model performance.

Configuration

env0 → SFT distillation: sample 300 tasks from the env0-mobile dataset, then Qwen3.5-397B-A17B teacher rollouts produce 300 SFT trajectories to fine-tune Qwen3.5-9B.

The entire experiment was completed on a single H100 80GB GPU.

The experiment uses the env0-mobile dataset — 2,003 verified productivity tasks, each paired with an executable oracle verifier. The dataset is available on request; contact us to access it.

As proof-of-concept experiment, we use Qwen3.5-397B-A17B as teacher model and Qwen3.5-9B as student model. We randomly picked 300 tasks from the env0-mobile dataset and use Qwen3.5-397B-A17B + OpenHands as teacher model to generate teacher model trajectories. In these 300 tasks, Qwen3.5-397B-A17B gets reward == 1 in 63 tasks, pass rate = 21.0%. We keep the full 300 trajectories set for SFT training. We use BenchFlow as eval harness, which can by default generate results.jsonl formatted trajectories that is compatible with prime-rl. BenchFlow also natively supports using prime-rl in BenchFlow CLI and SDK as post-train framework.

SettingValue
StudentQwen3.5-9B (non-quantized, BF16)
Teacher (SFT-data source)Qwen3.5-397B-A17B
TrainerBring-your-own SFT-data path, 1× H100 80GB
Renderer / chat templateQwen3.5 chat template (chat_template.jinja)
LossSFT cross-entropy (next-token on assistant turns)
Dataset size300 env0-mobile teacher trajectories (prime-sft.jsonl)
Sequence length8192
Epochs / steps1 epoch / 300 steps
Learning rate1e-4
Batch size1 (× gradient accumulation 8)
LoRA rank32
LoRA alpha / dropout64 / 0.05
Target modulesq_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trained scopetext / language parameters only (no multimodal)

Results

Training curves

Train loss
01234567050100150200250300StepLoss
Eval loss
012345678050100150200250300StepLoss
GPU utilization
020406080100050100150200250300StepPercent
GPU memory used
0102030405060050100150200250300StepGiB
SFT training metrics — Qwen3.5-9B · BF16 LoRA · 8k context · H100 · 300 steps.

Improvement on ClawsBench

After only 300 training steps, the SFT version of Qwen3.5-9B (open source on HF: huggingface.co/benchflow/benchflow-qwen35-9b) shows 31.58% performance improvement in 60 held-out eval tasks set (trial = 3, pass rate goes from 10.56% to 13.89%). For model inference we use Fireworks API to host Qwen3.5-9B both before and after SFT.

015304560Pass rate (%)10.56%13.89%43.33%51.67%60 held-out eval tasksstrict pass: reward ≥ 1.0+31.58%Qwen3.5-9BQwen3.5-9B SFT (300 steps)Qwen3.5-397B-A17B (teacher)GPT-5.4-minienv0 Standard60 strict pass rateAll models use OpenHands as the agent harness
Pass rate on the 60-task held-out eval, with ±1 standard-error bars (binomial). Qwen3.5-9B bars use n = 180 = 60 tasks × 3 trials; the Qwen3.5-397B-A17B teacher and GPT-5.4-mini references use n = 60 each from one standard60 run.

Improvement on SkillsBench

We also evaluate on SkillsBench (87 tasks, 8 domains) using the SFT version Qwen3.5-9B + SGLang + OpenHands + with-skill setup, where the same custom SFT snapshot lifts the pass rate from 7.01% to 11.88% — a +69.4% relative gain (1.69× baseline). These SkillsBench numbers are still provisional as the final trial completes. The full SkillsBench leaderboard is at skillsbench.ai/leaderboard.

051015Pass rate (%)7.01%11.88%87 SkillsBench taskswith-skill setting+69.4%Qwen3.5-9BQwen3.5-9B SFTSkillsBench pass rateQwen3.5-9B via SGLang + OpenHands, with skill
SkillsBench pass rate for Qwen3.5-9B before and after custom SFT, with ±1 standard-error bars (binomial). Baseline n = 870 = 87 tasks × 10 trials; SFT snapshot n = 261 ≈ 87 × 3 trials (provisional — final trial still completing).

The current results are still preliminary and we are running more ablation to explore other experiment configurations.

Cost

SFT training is cheap: a single Lambda H100 80GB ran 3.24 h at $3.29/h, $10.67 total. Excludes sandbox/runtime infra and retries.

References

  1. AfterQuery. On-Policy Distillation on GDPval. AfterQuery blog, Jun. 2026. afterquery.com/blog/on-policy-distillation-gdpval
  2. AfterQuery. How We Improved Terminal-Bench 2.0 with Tinker and Harbor. AfterQuery blog, Mar. 2026. afterquery.com/blog/how-we-improved-terminal-bench-2-with-tinker-and-harbor
  3. Prime Intellect. General Agent. Prime Intellect blog. primeintellect.ai/blog/general-agent
  4. BenchFlow. BenchFlow. GitHub repository; the eval harness used for all rollouts in this experiment. github.com/benchflow-ai/benchflow
  5. BenchFlow. env0. GitHub repository; the simulated-workspace environments used as the post-training environment. github.com/benchflow-ai/env0
  6. Li, X., et al. ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces. arXiv:2604.05172, Apr. 2026. env0 inherits its mock environments from ClawsBench. arxiv.org/abs/2604.05172
  7. Prime Intellect. prime-rl. GitHub repository; post-training framework — BenchFlow emits results.jsonl trajectories compatible with it. github.com/PrimeIntellect-ai/prime-rl
  8. Raoof, N., et al. OpenThoughts-Agent: Data Recipes for Agentic Models. arXiv:2606.24855, Jun. 2026. arxiv.org/abs/2606.24855
  9. Zhao, S., Xie, Z., Liu, M., Huang, J., Pang, G., Chen, F. & Grover, A. Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models. arXiv:2601.18734, Jan. 2026. arxiv.org/abs/2601.18734
  10. Su, H., Sun, R., Yoon, J., Yin, P., Yu, T. & Arık, S. Ö. Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments. arXiv:2501.10893, Jan. 2025. arxiv.org/abs/2501.10893

Acknowledgments

We gratefully thank Lambda and Prime Intellect for sponsoring the H100 GPU compute that made this experiment possible.

← All news