We're Hiring AI Researchers & Engineers

Benchmark
Runtime

Run out-of-the-box evals or benchmarks on the cloud. Save weeks of setting up and development by using evals on our platform.

Idle
Nodejs
Python
HTTP
from benchflow import load_benchmark, BaseAgent

bench = load_benchmark(benchmark_name="cmu/webarena")

class YourAgent(BaseAgent):
  pass

your_agents = YourAgent()

run_id = bench.run(
    task_id=[1, 2, 3], 
    agents=your_agents
)

result = bench.get_result(run_id)

Backed By

Backed By 1
Backed By 2
Backed By 3
Jeff Dean
Chief Scientist, Google
Arash Ferdowsi
Founder/CTO of Dropbox
+ more
$1M raised+
BenchFlowBenchFlowBenchFlow

Use Cases

Largest library of benchmarks

Utilize the largest library of benchmarks for comprehensive evaluations.

Use Case 1
BenchFlow
OpenAI
GitHub
Anthropic
X
Google

Extend existing benchmarks

Easily extend and customize existing benchmarks to fit your specific needs.

Use Case 3

Create your own evals

Design and implement your own system evaluations with flexibility and ease.

Use Case 4

What Our Users Say

Testimonial

Tianpei Gu

Research Scientist at TikTok

"If Benchmarkthing existed before, it would have saved me weeks of setting up miscellaneous sub-tasks in VLMs. I'm excited about using it to benchmark other Computer Vision tasks."

Testimonial

Yitao Liu

NLP Researcher at Princeton

"With Benchmarkthing's endpoint, I was able to focus on developing web agents instead of setting up configs and environments for the task execution."

Testimonial

Gus Ye

Founder of Memobase.io

"Using Benchmarkthing is like having Codecov but for our Retrieval-Augmented Generation (RAG) workflows. It makes them a lot more reliable."

Popular Benchmarks

All Categories
Agent
Code
General
Embedding
Performance
Vision
Long Context
WebArena
Get Access
A realistic web environment for developing autonomous agents. GPT-4 agent achieves 14.41% success rate vs 78.24% human performance.
Carnegie Mellon University
MLE-bench
Get Access
A benchmark for measuring how well AI agents perform at machine learning engineering.
OpenAI
SWE-bench
Get Access
A benchmark for software engineering tasks.
Princeton NLP
SWE-bench Multimodal
Get Access
A benchmark for evaluating AI systems on visual software engineering tasks with JavaScript.
Princeton NLP
Agentbench
Get Access
A comprehensive benchmark to evaluate LLMs as agents (ICLR'24)
Tsinghua University
Tau (𝜏)-Bench
Get Access
A benchmark for evaluating AI agents' performance in real-world settings with dynamic interaction.
Sierra AI
BIRD-SQL
Get Access
A pioneering, cross-domain dataset for evaluating text-to-SQL models on large-scale databases.
LegalBench
Get Access
A collaboratively built benchmark for measuring legal reasoning in large language models.
Hazy Research at Stanford
STS (Semantic Textual Similarity)
Get Access
A benchmark for evaluating semantic equivalence between text snippets.