Blog

Updates & case studies

Deep dives into our projects, benchmarks, and experiments in frontier agent evaluation.

January 2026

SkillsBench

The first evaluation framework measuring how skills (custom instructions) work for AI agents. 84 expert-curated tasks across diverse, high-GDP-value domains. The first dataset measuring how powerful models are at using skills.

Pokemon Red battle comparison across 5 AI models

March 2025

PokemonGym

First open-source harness for any LLM and agent to play Pokemon Red and Blue. Tests vision, reasoning, planning, memory, and sequential decision-making. Featured in the Gemini model launch.

BenchFlow Hub showing 66 integrated benchmarks

December 2024

BenchFlow Hub & Runtime

The first protocol for agent and benchmark unification. HuggingFace for benchmarks and RL environments. One-line setup for 60+ benchmarks spanning NLP, web agents, code, medical AI, and more.