Jun 28, 2026Update

How we improved SkillsBench v1.1 scores by 69.4% using env0

What is env0

LLM agents are increasingly deployed to automate productivity tasks — email triage, meeting scheduling, document management — but evaluating them on live services is risky due to potentially irreversible changes. Existing benchmarks rely on simplified environments and fail to capture realistic, stateful, multi-service workflows. So we made env0 to address this. env0 inherits from the mock environments in ClawsBench, a benchmark for evaluating and improving LLM agents in realistic productivity settings. The current v0.1 version has five high-fidelity mock services that replicate real Google Workspace and Slack APIs with full state management and deterministic snapshot/restore.

We run post-train experiments to demonstrate how env0 can be used to improve model performance.

Configuration

env0 → SFT distillation: sample 300 tasks from the env0-mobile dataset, then Qwen3.5-397B-A17B teacher rollouts produce 300 SFT trajectories to fine-tune Qwen3.5-9B.

The entire experiment was completed on a single H100 80GB GPU.

The experiment uses the env0-mobile dataset — 2,003 verified productivity tasks, each paired with an executable oracle verifier. The dataset is available on request; contact us to access it.

As proof-of-concept experiment, we use Qwen3.5-397B-A17B as teacher model and Qwen3.5-9B as student model. We randomly picked 300 tasks from the env0-mobile dataset and use Qwen3.5-397B-A17B + OpenHands as teacher model to generate teacher model trajectories. In these 300 tasks, Qwen3.5-397B-A17B gets reward == 1 in 63 tasks, pass rate = 21.0%. We keep the full 300 trajectories set for SFT training. We use BenchFlow as eval harness, which can by default generate results.jsonl formatted trajectories that is compatible with prime-rl. BenchFlow also natively supports using prime-rl in BenchFlow CLI and SDK as post-train framework.

Setting	Value
Student	`Qwen3.5-9B` (non-quantized, BF16)
Teacher (SFT-data source)	`Qwen3.5-397B-A17B`
Trainer	Bring-your-own SFT-data path, 1× H100 80GB
Renderer / chat template	Qwen3.5 chat template (`chat_template.jinja`)
Loss	SFT cross-entropy (next-token on assistant turns)
Dataset size	300 env0-mobile teacher trajectories (`prime-sft.jsonl`)
Sequence length	`8192`
Epochs / steps	1 epoch / 300 steps
Learning rate	`1e-4`
Batch size	`1` (× gradient accumulation 8)
LoRA rank	`32`
LoRA alpha / dropout	`64` / `0.05`
Target modules	`q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj`
Trained scope	text / language parameters only (no multimodal)

Results

Training curves

Train loss

Eval loss

GPU utilization

GPU memory used

SFT training metrics — Qwen3.5-9B · BF16 LoRA · 8k context · H100 · 300 steps.

Improvement on ClawsBench

After only 300 training steps, the SFT version of Qwen3.5-9B (open source on HF: huggingface.co/benchflow/benchflow-qwen35-9b) shows 31.58% performance improvement in 60 held-out eval tasks set (trial = 3, pass rate goes from 10.56% to 13.89%). For model inference we use Fireworks API to host Qwen3.5-9B both before and after SFT.

Pass rate on the 60-task held-out eval, with ±1 standard-error bars (binomial). Qwen3.5-9B bars use n = 180 = 60 tasks × 3 trials; the Qwen3.5-397B-A17B teacher and GPT-5.4-mini references use n = 60 each from one standard60 run.

Improvement on SkillsBench

We also evaluate on SkillsBench (87 tasks, 8 domains) using the SFT version Qwen3.5-9B + SGLang + OpenHands + with-skill setup, where the same custom SFT snapshot lifts the pass rate from 7.01% to 11.88% — a +69.4% relative gain (1.69× baseline). These SkillsBench numbers are still provisional as the final trial completes. The full SkillsBench leaderboard is at skillsbench.ai/leaderboard.

SkillsBench pass rate for Qwen3.5-9B before and after custom SFT, with ±1 standard-error bars (binomial). Baseline n = 870 = 87 tasks × 10 trials; SFT snapshot n = 261 ≈ 87 × 3 trials (provisional — final trial still completing).

The current results are still preliminary and we are running more ablation to explore other experiment configurations.

Cost

SFT training is cheap: a single Lambda H100 80GB ran 3.24 h at $3.29/h, $10.67 total. Excludes sandbox/runtime infra and retries.

References

AfterQuery. On-Policy Distillation on GDPval. AfterQuery blog, Jun. 2026. afterquery.com/blog/on-policy-distillation-gdpval
AfterQuery. How We Improved Terminal-Bench 2.0 with Tinker and Harbor. AfterQuery blog, Mar. 2026. afterquery.com/blog/how-we-improved-terminal-bench-2-with-tinker-and-harbor
Prime Intellect. General Agent. Prime Intellect blog. primeintellect.ai/blog/general-agent
BenchFlow. BenchFlow. GitHub repository; the eval harness used for all rollouts in this experiment. github.com/benchflow-ai/benchflow
BenchFlow. env0. GitHub repository; the simulated-workspace environments used as the post-training environment. github.com/benchflow-ai/env0
Li, X., et al. ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces. arXiv:2604.05172, Apr. 2026. env0 inherits its mock environments from ClawsBench. arxiv.org/abs/2604.05172
Prime Intellect. prime-rl. GitHub repository; post-training framework — BenchFlow emits results.jsonl trajectories compatible with it. github.com/PrimeIntellect-ai/prime-rl
Raoof, N., et al. OpenThoughts-Agent: Data Recipes for Agentic Models. arXiv:2606.24855, Jun. 2026. arxiv.org/abs/2606.24855
Zhao, S., Xie, Z., Liu, M., Huang, J., Pang, G., Chen, F. & Grover, A. Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models. arXiv:2601.18734, Jan. 2026. arxiv.org/abs/2601.18734
Su, H., Sun, R., Yoon, J., Yin, P., Yu, T. & Arık, S. Ö. Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments. arXiv:2501.10893, Jan. 2025. arxiv.org/abs/2501.10893

Acknowledgments

We gratefully thank Lambda and Prime Intellect for sponsoring the H100 GPU compute that made this experiment possible.

← All news