How we improved SkillsBench v1.1 scores by 69.4% using env0
What is env0
LLM agents are increasingly deployed to automate productivity tasks — email triage, meeting scheduling, document management — but evaluating them on live services is risky due to potentially irreversible changes. Existing benchmarks rely on simplified environments and fail to capture realistic, stateful, multi-service workflows. So we made env0 to address this. env0 inherits from the mock environments in ClawsBench, a benchmark for evaluating and improving LLM agents in realistic productivity settings. The current v0.1 version has five high-fidelity mock services that replicate real Google Workspace and Slack APIs with full state management and deterministic snapshot/restore.
We run post-train experiments to demonstrate how env0 can be used to improve model performance.
Configuration
The entire experiment was completed on a single H100 80GB GPU.
The experiment uses the env0-mobile dataset — 2,003 verified productivity tasks, each paired with an executable oracle verifier. The dataset is available on request; contact us to access it.
As proof-of-concept experiment, we use Qwen3.5-397B-A17B as teacher model and Qwen3.5-9B as student model. We randomly picked 300 tasks from the env0-mobile dataset and use Qwen3.5-397B-A17B + OpenHands as teacher model to generate teacher model trajectories. In these 300 tasks, Qwen3.5-397B-A17B gets reward == 1 in 63 tasks, pass rate = 21.0%. We keep the full 300 trajectories set for SFT training. We use BenchFlow as eval harness, which can by default generate results.jsonl formatted trajectories that is compatible with prime-rl. BenchFlow also natively supports using prime-rl in BenchFlow CLI and SDK as post-train framework.
| Setting | Value |
|---|---|
| Student | Qwen3.5-9B (non-quantized, BF16) |
| Teacher (SFT-data source) | Qwen3.5-397B-A17B |
| Trainer | Bring-your-own SFT-data path, 1× H100 80GB |
| Renderer / chat template | Qwen3.5 chat template (chat_template.jinja) |
| Loss | SFT cross-entropy (next-token on assistant turns) |
| Dataset size | 300 env0-mobile teacher trajectories (prime-sft.jsonl) |
| Sequence length | 8192 |
| Epochs / steps | 1 epoch / 300 steps |
| Learning rate | 1e-4 |
| Batch size | 1 (× gradient accumulation 8) |
| LoRA rank | 32 |
| LoRA alpha / dropout | 64 / 0.05 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Trained scope | text / language parameters only (no multimodal) |
Results
Training curves
Improvement on ClawsBench
After only 300 training steps, the SFT version of Qwen3.5-9B (open source on HF: huggingface.co/benchflow/benchflow-qwen35-9b) shows 31.58% performance improvement in 60 held-out eval tasks set (trial = 3, pass rate goes from 10.56% to 13.89%). For model inference we use Fireworks API to host Qwen3.5-9B both before and after SFT.
Improvement on SkillsBench
We also evaluate on SkillsBench (87 tasks, 8 domains) using the SFT version Qwen3.5-9B + SGLang + OpenHands + with-skill setup, where the same custom SFT snapshot lifts the pass rate from 7.01% to 11.88% — a +69.4% relative gain (1.69× baseline). These SkillsBench numbers are still provisional as the final trial completes. The full SkillsBench leaderboard is at skillsbench.ai/leaderboard.
The current results are still preliminary and we are running more ablation to explore other experiment configurations.
Cost
SFT training is cheap: a single Lambda H100 80GB ran 3.24 h at $3.29/h, $10.67 total. Excludes sandbox/runtime infra and retries.
References
- AfterQuery. On-Policy Distillation on GDPval. AfterQuery blog, Jun. 2026. afterquery.com/blog/on-policy-distillation-gdpval
- AfterQuery. How We Improved Terminal-Bench 2.0 with Tinker and Harbor. AfterQuery blog, Mar. 2026. afterquery.com/blog/how-we-improved-terminal-bench-2-with-tinker-and-harbor
- Prime Intellect. General Agent. Prime Intellect blog. primeintellect.ai/blog/general-agent
- BenchFlow. BenchFlow. GitHub repository; the eval harness used for all rollouts in this experiment. github.com/benchflow-ai/benchflow
- BenchFlow. env0. GitHub repository; the simulated-workspace environments used as the post-training environment. github.com/benchflow-ai/env0
- Li, X., et al. ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces. arXiv:2604.05172, Apr. 2026. env0 inherits its mock environments from ClawsBench. arxiv.org/abs/2604.05172
- Prime Intellect. prime-rl. GitHub repository; post-training framework — BenchFlow emits
results.jsonltrajectories compatible with it. github.com/PrimeIntellect-ai/prime-rl - Raoof, N., et al. OpenThoughts-Agent: Data Recipes for Agentic Models. arXiv:2606.24855, Jun. 2026. arxiv.org/abs/2606.24855
- Zhao, S., Xie, Z., Liu, M., Huang, J., Pang, G., Chen, F. & Grover, A. Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models. arXiv:2601.18734, Jan. 2026. arxiv.org/abs/2601.18734
- Su, H., Sun, R., Yoon, J., Yin, P., Yu, T. & Arık, S. Ö. Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments. arXiv:2501.10893, Jan. 2025. arxiv.org/abs/2501.10893
Acknowledgments
We gratefully thank Lambda and Prime Intellect for sponsoring the H100 GPU compute that made this experiment possible.