Skip to content

AGI Benchmarks & Evals

Standard benchmarks (MMLU, GSM8K) are saturated. The path to AGI is now measured by post-Turing tests that evaluate fluid reasoning, real-world problem solving, and capabilities no narrow system can fake.

Reasoning Coding Science Multi-Modal Open Source

Post-Turing AGI Benchmarks

Benchmark Creator What It Tests Best Score (2026) Human Baseline Links
ARC-AGI François Chollet / ARC Prize Foundation Fluid reasoning, novel pattern recognition, abstract tasks trivial for humans Best AI: ~55-65% ~85% arcprize.org, GitHub
SWE-bench Verified Princeton NLP Real GitHub issue resolution, end-to-end software engineering Best: ~50% (Verified) Human SWE: varies swebench.com, Paper
GAIA Meta / HuggingFace Multi-step reasoning with tool use (web, code, files) Best: ~75% L1, ~50% L3 92% Paper, HuggingFace
GPQA NYU PhD-level science (physics, chemistry, biology) not answerable via search Best: ~75% Expert PhD: ~65%, Non-expert: ~34% Paper
HLE Scale AI / CAIS 3,000+ expert questions across dozens of domains; hardest AI test Best: ~18-58% (model dependent) Human expert: varies by domain Paper, GitHub
FrontierMath Epoch AI Unpublished competition-level math requiring hours to solve Best: <5% Fields Medalists: varies epochai.org/frontiermath, Paper
MathArena AIME / MathArena American math competition problems; creative problem-solving Best: o3/Gemini solve 85-90% Human top competitors: 100% matharena.ai
ARC-AGI-2 ARC Prize Foundation (2025) Harder ARC-AGI with reduced memorization, more abstraction Best AI: ~30% ~85% arcprize.org

Saturated Benchmarks (Historical / Foundational)

These benchmarks are still important for tracking progress and establishing baselines, but they no longer differentiate frontier models. When a benchmark is "saturated," most top models score 90%+ and improvements are marginal. The field has moved to harder evaluations above.

Benchmark What It Tests Status Links
MMLU 57-subject knowledge test (STEM, humanities, social sciences) 90%+ saturated Paper
MMLU-Pro Harder MMLU: 10 choices, more reasoning-heavy questions ~80% by top models Paper
GSM8K Grade-school math word problems, multi-step arithmetic 95%+ saturated Paper
HumanEval Python code generation from docstrings 95%+ saturated Paper
HellaSwag Commonsense NLI; predicting plausible continuations 95%+ saturated Paper
BIG-Bench Hard Challenging BIG-Bench subset that prior LMs failed at 85%+ by frontier models Paper

Agent & Embodied Benchmarks

Benchmark What It Tests Best Score (2026) Links
WebArena Real web tasks across functional sites (shopping, forums, GitLab) ~35% (human: 78%) Paper
OSWorld Real OS GUI tasks: file management, apps, multi-app workflows ~22% (human: 72%) Paper
SWE-bench Multimodal GUI-based coding tasks requiring visual understanding Emerging swebench.com
MINT Multi-turn tool use with sustained sequential reasoning Emerging Paper
AgentBench Diverse agent environments: OS, database, web, games Varies by environment Paper, GitHub

How to Read These Benchmarks

Benchmark scores alone don't define AGI. A model that tops one benchmark while failing another is still narrow AI. True AGI requires generality (performing well across all benchmarks, not just one), robustness (performing well on novel variations and out-of-distribution inputs, not just memorized patterns), and efficiency (not requiring task-specific training data or fine-tuning for each new domain). The best way to gauge progress toward AGI is to track performance across the full suite of post-Turing benchmarks above, not any single score in isolation.

See also: ARC Prize -- the $1M+ challenge for AGI | DeepMind Levels of AGI Framework