Hub
    Docs
Try for Free

Benchmark Hub

Featured Benchmarks

VibeCode Arena

VibeCode Arena

🏢
BenchFlow
code
Pokemon Gym

Pokemon Gym

🏢
BenchFlow
reasoning
JFK Arena

JFK Arena

🏢
BenchFlow
retrieval
PaperBench

PaperBench

🏢
OpenAI
agent
WebArena

WebArena

🏢
Carnegie Mellon University
agent
SWE-Bench

SWE-Bench

🏢
Princeton NLP
code
RareBench

RareBench

🏢
chenxz1111
knowledge
Bird-SQL

Bird-SQL

🏢
AlibabaResearch
code
MedQA-CS

MedQA-CS

🏢
Bio-NLP
knowledge
WebCanvas

WebCanvas

🏢
iMeanAI
agent
MMLU-Pro

MMLU-Pro

🏢
TIGER-AI-Lab
knowledge

All Benchmarks

  • Hub
  • Contact
DiscordGitHubXLinkedIn
  • agent
  • code
  • commonsense
  • embedding
  • general
  • knowledge
  • language
  • long-context
  • multimodal
  • performance
  • reasoning
  • retrieval
  • safety
  • tool-calling
  • vision

All Benchmarks

63
  • 🏢
    alhridoytest
    Updated 19 days ago
    0
  • 🏢
    BenchFlowsimple-qa
    Updated 20 days ago
    1
  • 🏢
    BenchFlowTaskBench
    agent
    tool-calling
    multimodal
    Updated a month ago
    0
  • 🏢
    BenchFlowEQBench
    reasoning
    language
    commonsense
    ...
    Updated a month ago
    0
  • 🏢
    BenchFlowTauBench
    agent
    tool-calling
    reasoning
    ...
    Updated a month ago
    0
  • 🏢
    BenchFlowAIME2024
    knowledge
    performance
    reasoning
    Updated a month ago
    0
  • 🏢
    BenchFlowOSWorld
    agent
    tool-calling
    multimodal
    Updated a month ago
    0
  • 🏢
    BenchFlowBIGBenchHard
    commonsense
    reasoning
    knowledge
    Updated a month ago
    0
  • 🏢
    BenchFlowMGSM
    reasoning
    language
    commonsense
    ...
    Updated a month ago
    0
  • 🏢
    BenchFlowGSM8K
    knowledge
    commonsense
    language
    Updated a month ago
    0
  • 🏢
    BenchFlowWMDP
    safety
    performance
    knowledge
    ...
    Updated a month ago
    0
  • 🏢
    BenchFlowSecQA
    knowledge
    performance
    reasoning
    ...
    Updated a month ago
    1
  • 🏢
    BenchFlowMind2Web
    agent
    tool-calling
    Updated a month ago
    0
  • 🏢
    BenchFlowAssistantBench
    agent
    tool-calling
    reasoning
    Updated a month ago
    0
  • 🏢
    BenchFlowMBPP
    code
    Updated a month ago
    0
  • 🏢
    BenchFlowDS-1000
    code
    Updated a month ago
    0
  • 🏢
    BenchFlowAPPS
    code
    Updated a month ago
    0
  • 🏢
    BenchFlowHELMET
    long-context
    Updated a month ago
    0
  • 🏢
    BenchFlowLoft
    long-context
    Updated a month ago
    0
  • 🏢
    BenchFlowBabiLong
    long-context
    Updated a month ago
    0
  • 🏢
    BenchFlowInfiniteBench
    long-context
    Updated a month ago
    0
  • 🏢
    BenchFlowMMGenBench
    vision
    multimodal
    reasoning
    Updated a month ago
    0
  • 🏢
    BenchFlowStableToolBench
    tool-calling
    agent
    Updated a month ago
    0
  • 🏢
    BenchFlowRouter-Bench
    agent
    Updated a month ago
    0
  • 🏢
    BenchFlowNexus-Bench
    agent
    tool-calling
    Updated a month ago
    0
  • 🏢
    BenchFlowHotpotqa
    reasoning
    language
    Updated a month ago
    0
  • 🏢
    BenchFlowMMOCR
    vision
    Updated a month ago
    0
  • 🏢
    BenchFlowBeir
    retrieval
    Updated a month ago
    0
  • 🏢
    BenchFlowCodeXGLUE
    code
    performance
    Updated a month ago
    0
  • 🏢
    BenchFlowBigBench
    general
    Updated a month ago
    0
  • 🏢
    BenchFlowAlexarena
    agent
    multimodal
    Updated a month ago
    0
  • 🏢
    BenchFlowMEGABench
    multimodal
    performance
    Updated a month ago
    0
  • 🏢
    BenchFlowMobileAIBench
    performance
    code
    Updated a month ago
    0
  • 🏢
    BenchFlowSpec-Bench
    tool-calling
    performance
    language
    Updated a month ago
    0
  • 🏢
    BenchFlowTruthfulQA
    safety
    Updated a month ago
    0
  • 🏢
    BenchFlowSuperGLUE
    language
    reasoning
    Updated a month ago
    0
  • 🏢
    BenchFlowMMLU
    reasoning
    knowledge
    Updated a month ago
    0
  • 🏢
    BenchFlowHumanEval
    code
    Updated a month ago
    0
  • 🏢
    BenchFlowHellaSwag
    reasoning
    commonsense
    Updated a month ago
    0
  • 🏢
    BenchFlowHELM
    performance
    reasoning
    safety
    Updated a month ago
    0
  • 🏢
    BenchFlowLegalBench
    reasoning
    knowledge
    Updated a month ago
    0
  • 🏢
    BenchFlowAgentbench
    agent
    reasoning
    Updated a month ago
    0
  • 🏢
    BenchFlowSWE-bench-Multimodal
    agent
    code
    tool-calling
    Updated a month ago
    0
  • 🏢
    BenchFlowMLE-bench
    agent
    code
    knowledge
    Updated a month ago
    0
  • 🏢
    Lilaobatest
    Updated a month ago
    0
  • 🏢
    Allentest
    Updated a month ago
    0
  • 🏢
    BenchFlowPokemonGym
    agent
    tool-calling
    vision
    ...
    Updated 2 months ago
    0
  • 🏢
    Davide221test
    Updated 2 months ago
    0
  • 🏢
    abderrahmane-brhumaneval
    Updated 2 months ago
    1
  • 🏢
    xiangyi-liBIRD-critiq
    Updated 2 months ago
    0
  • 🏢
    xiangyi-liOS-World
    Updated 2 months ago
    0
  • 🏢
    xiangyi-lirare
    Updated 2 months ago
    0
  • 🏢
    holmansneydercautomation
    Updated 2 months ago
    0
  • 🏢
    BenchFlowrarebench
    knowledge
    general
    Updated 2 months ago
    0
  • 🏢
    BenchFlowrare
    Updated 2 months ago
    0
  • 🏢
    xiangyi-lirarebench
    Updated 2 months ago
    0
  • 🏢
    BenchFlowmedqa-cs
    knowledge
    general
    reasoning
    Updated 2 months ago
    0
  • 🏢
    BenchFlowSwebench
    agent
    code
    Updated 2 months ago
    0
  • 🏢
    BenchFlowMMLU-PRO
    general
    knowledge
    language
    Updated 2 months ago
    0
  • 🏢
    BenchFlowBird
    tool-calling
    code
    agent
    Updated 2 months ago
    0
  • 🏢
    BenchFlowwebcanvas
    agent
    tool-calling
    vision
    Updated 2 months ago
    0
  • 🏢
    BenchFlowwebarena
    agent
    tool-calling
    vision
    Updated 2 months ago
    0
  • 🏢
    xiangyi-liwebarena
    Updated 2 months ago
    0