GamingVisionPlanning

PokemonGym

Vision & Reasoning Benchmark

March 2025

88

Stars

5+

Models

Gemini

Featured

PokemonGym is the first open-source evaluation harness that enables any LLM or AI agent to play Pokemon Red and Blue. It tests a comprehensive range of cognitive capabilities: vision understanding, strategic reasoning, long-horizon planning, memory management, and sequential decision-making.

The project was featured in Google's Gemini model launch and has gained significant community traction with 88+ GitHub stars. It supports 5+ frontier models and provides a standardized environment for comparing agent capabilities in a complex, interactive game setting.

Unlike traditional benchmarks that test isolated capabilities, PokemonGym requires agents to integrate multiple skills simultaneously — reading the game screen, remembering past events, planning multi-step strategies, and executing precise actions over extended play sessions.

← All posts