AGI Benchmarks & Evals¶
Standard benchmarks (MMLU, GSM8K) are saturated. The path to AGI is now measured by post-Turing tests that evaluate fluid reasoning, real-world problem solving, and capabilities no narrow system can fake.
Post-Turing AGI Benchmarks¶
| Benchmark | Creator | What It Tests | Best Score (2026) | Human Baseline | Links |
|---|---|---|---|---|---|
| ARC-AGI | François Chollet / ARC Prize Foundation | Fluid reasoning, novel pattern recognition, abstract tasks trivial for humans | Best AI: ~55-65% | ~85% | arcprize.org, GitHub |
| SWE-bench Verified | Princeton NLP | Real GitHub issue resolution, end-to-end software engineering | Best: ~50% (Verified) | Human SWE: varies | swebench.com, Paper |
| GAIA | Meta / HuggingFace | Multi-step reasoning with tool use (web, code, files) | Best: ~75% L1, ~50% L3 | 92% | Paper, HuggingFace |
| GPQA | NYU | PhD-level science (physics, chemistry, biology) not answerable via search | Best: ~75% | Expert PhD: ~65%, Non-expert: ~34% | Paper |
| HLE | Scale AI / CAIS | 3,000+ expert questions across dozens of domains; hardest AI test | Best: ~18-58% (model dependent) | Human expert: varies by domain | Paper, GitHub |
| FrontierMath | Epoch AI | Unpublished competition-level math requiring hours to solve | Best: <5% | Fields Medalists: varies | epochai.org/frontiermath, Paper |
| MathArena | AIME / MathArena | American math competition problems; creative problem-solving | Best: o3/Gemini solve 85-90% | Human top competitors: 100% | matharena.ai |
| ARC-AGI-2 | ARC Prize Foundation (2025) | Harder ARC-AGI with reduced memorization, more abstraction | Best AI: ~30% | ~85% | arcprize.org |
Saturated Benchmarks (Historical / Foundational)¶
These benchmarks are still important for tracking progress and establishing baselines, but they no longer differentiate frontier models. When a benchmark is "saturated," most top models score 90%+ and improvements are marginal. The field has moved to harder evaluations above.
| Benchmark | What It Tests | Status | Links |
|---|---|---|---|
| MMLU | 57-subject knowledge test (STEM, humanities, social sciences) | 90%+ saturated | Paper |
| MMLU-Pro | Harder MMLU: 10 choices, more reasoning-heavy questions | ~80% by top models | Paper |
| GSM8K | Grade-school math word problems, multi-step arithmetic | 95%+ saturated | Paper |
| HumanEval | Python code generation from docstrings | 95%+ saturated | Paper |
| HellaSwag | Commonsense NLI; predicting plausible continuations | 95%+ saturated | Paper |
| BIG-Bench Hard | Challenging BIG-Bench subset that prior LMs failed at | 85%+ by frontier models | Paper |
Agent & Embodied Benchmarks¶
| Benchmark | What It Tests | Best Score (2026) | Links |
|---|---|---|---|
| WebArena | Real web tasks across functional sites (shopping, forums, GitLab) | ~35% (human: 78%) | Paper |
| OSWorld | Real OS GUI tasks: file management, apps, multi-app workflows | ~22% (human: 72%) | Paper |
| SWE-bench Multimodal | GUI-based coding tasks requiring visual understanding | Emerging | swebench.com |
| MINT | Multi-turn tool use with sustained sequential reasoning | Emerging | Paper |
| AgentBench | Diverse agent environments: OS, database, web, games | Varies by environment | Paper, GitHub |
How to Read These Benchmarks¶
Benchmark scores alone don't define AGI. A model that tops one benchmark while failing another is still narrow AI. True AGI requires generality (performing well across all benchmarks, not just one), robustness (performing well on novel variations and out-of-distribution inputs, not just memorized patterns), and efficiency (not requiring task-specific training data or fine-tuning for each new domain). The best way to gauge progress toward AGI is to track performance across the full suite of post-Turing benchmarks above, not any single score in isolation.
See also: ARC Prize -- the $1M+ challenge for AGI | DeepMind Levels of AGI Framework