Abstract
The
field of Large Language Models (LLMs) is in a state of perpetual and
rapid evolution, making robust, discriminating evaluation methodologies
more critical than ever. This research article provides an exhaustive
analysis of the current landscape of LLM benchmarks as of mid-2025,
meticulously covering every test detailed in recent surveys and reports.
The primary contribution of this work is a comprehensive hierarchy of
these benchmarks, ordered by their difficulty for modern AI systems. As
foundational benchmarks like MMLU become saturated, a new and formidable
generation of evaluations has emerged to test the true frontiers of AI
capabilities in areas such as research-level mathematics, expert
multidisciplinary knowledge, real-world agentic coding, and nuanced
ethical alignment. This article synthesizes the latest data to rank
these diverse tests, from the nearly unsolvable to those that are now
largely mastered. Our analysis reveals that while models like Gemini 2.5
Pro, OpenAI's o3/o4 series, and Anthropic's Claude 4 demonstrate
exceptional, often superhuman performance in specific domains, their
abilities are profoundly challenged by tests requiring deep,
generalizable, and adaptive reasoning across broad and novel contexts.
This definitive ordering highlights the immense progress in AI while
simultaneously mapping the significant gaps that remain on the path
toward artificial general intelligence (AGI).
Introduction: The Imperative for New Measuring
Benchmarks
are the crucibles in which the capabilities of Large Language Models
are forged and quantitatively tested. They provide a structured,
objective, and standardized mechanism for assessing a model's
proficiency in core areas like natural language understanding, complex
reasoning, and code generation. However, the blistering pace of AI
advancement has created a paradox: the very tools used to measure
progress are constantly at risk of obsolescence.
Traditional
benchmarks such as MMLU (Massive Multitask Language Understanding),
once considered a gold standard, have reached a point of saturation
where leading models consistently achieve accuracies over 90%. This
"benchmark saturation" renders these tests less effective at
differentiating the capabilities of state-of-the-art models, creating a
pressing need for more rigorous, dynamic, and challenging evaluations.
In response, the AI research community has developed a new and diverse
suite of benchmarks designed to push the limits of AI at the frontiers
of human expertise. This article provides a comprehensive overview of
these modern benchmarks, ordered by difficulty, to paint the clearest
possible picture of where the true challenges lie for today's most
advanced AI systems.
A Hierarchy of Modern LLM Benchmarks:
From Frontier Challenges to Foundational Tasks
The
following analysis presents a comprehensive hierarchy of LLM
benchmarks, ranked from the most difficult to the least, based on the
performance of top-tier models as of July 2025. Low scores indicate a
higher degree of difficulty and a greater challenge for current AI.
Tier 1: The Unsolved Frontier (Extreme Difficulty)
These
benchmarks represent the absolute pinnacle of difficulty, where even
the most advanced AI systems perform poorly, demonstrating the profound
gap that still exists between current capabilities and true expert-level
intelligence.
1. FrontierMath
Difficulty Level: Extreme
Description:
A brand-new (2024) advanced mathematics benchmark from Epoch AI,
FrontierMath is designed to gauge LLMs on research-level mathematics. It
contains 300 completely new, unpublished problems across modern fields
like number theory, algebraic geometry, and analysis. Crafted by
approximately 60 mathematicians, including Fields Medalists, each
problem requires hours of human effort and has a single numerical
answer.
Performance and Difficulty Analysis:
This benchmark tests true research proficiency, a significant leap
beyond the high-school or contest-level math found in other tests. As a
result, performance is exceptionally low. Currently, even top models
like GPT-4 solve less than 2% of these problems, with OpenAI's o3
achieving approximately 32% only in a specialized, high-compute setting.
The extremely low scores make it a powerful, if coarse, differentiator
at the absolute edge of AI reasoning.
Relevance and Implication:
FrontierMath's novelty and extreme difficulty mean it is free from data
contamination and effectively tests genuine mathematical discovery. It
clearly distinguishes the limits of current AI from human-level creative
reasoning in a highly rigorous domain.
2. Humanity's Last Exam (HLE)
Difficulty Level: Extreme
Description:
Developed by the Center for AI Safety and Scale AI, HLE is positioned
as the "final closed-ended academic benchmark." It comprises 2,500
multi-modal (multiple-choice and short-answer) questions across dozens
of subjects, including mathematics, humanities, and the natural
sciences. Questions are crowdsourced from global subject-matter experts
and filtered to be unsuitable for quick internet retrieval, ensuring a
test of deep, integrated reasoning.
Performance and Difficulty Analysis:
HLE has proven to be extraordinarily challenging. As of June 2025, top
models like the Gemini 2.5 Pro Preview achieve only 21.64% accuracy,
with OpenAI o3 following at 20.32%. Recent online discussions in July
2025 mention Kimi-Researcher achieving 26.9%, but even this higher score
underscores the benchmark's immense difficulty.
Relevance and Implication:
Cited in Stanford HAI's AI Index 2025 Annual Report as one of the "more
challenging benchmarks," HLE demonstrates that despite high scores on
narrower tests, current AI systems are far from human expert-level
understanding across a broad spectrum of knowledge. It serves as a stark
reminder of the remaining hurdles toward AGI.
Tier 2: The Proving Grounds (Very High Difficulty)
This
tier includes benchmarks designed as successors to saturated tests or
as evaluations of complex, multifaceted skills. Models perform better
than on the unsolved frontier but still struggle significantly.
3. MMLU-Pro
Difficulty Level: Very High
Description:
MMLU-Pro is a 2024 enhancement to the original MMLU, designed by
researchers at the University of Waterloo to break the performance
ceiling of its predecessor. It deepens and hardens the questions,
retaining 14 knowledge domains but featuring ~12,000 college-level
questions. Crucially, it expands the number of multiple-choice options
from four to ten, making it more robust against guessing and demanding
deeper reasoning.
Performance and Difficulty Analysis:
The increased difficulty is reflected in performance, with the source
material noting that current top models achieve less than 50% accuracy.
This successfully re-establishes the benchmark as a discriminative test
for frontier models.
Relevance and Implication:
MMLU-Pro is a direct and successful response to benchmark saturation.
It shows that evolving existing benchmark structures by increasing
complexity and the number of distractors can maintain their relevance
for evaluating state-of-the-art models.
4. BIG-Bench Extra Hard (BBEH)
Difficulty Level: Very High
Description:
Released in 2024 by Google DeepMind, BBEH takes 23 difficult reasoning
tasks from the Big-Bench Hard (BBH) subset and replaces them with
significantly harder versions. The new tasks rephrase or extend the
original problems with longer contexts, multi-step hints, and
adversarial twists intended to make current models score near zero.
Performance and Difficulty Analysis:
The design is highly effective, with the best-performing models
achieving only between 10% and 50% accuracy. The extreme difficulty and
low number of examples per task can make the evaluation less
statistically smooth, but it is a powerful probe of advanced reasoning.
Relevance and Implication:
BBEH is crucial for evaluating advanced reasoning and a model's
resilience to complexity. The poor performance of top models suggests
that their reasoning abilities can be fragile and may break down when
problems are structured in novel or more complex ways.
5. SWE-bench (Software Engineering Bench)
Difficulty Level: High
Description:
Introduced in 2023, SWE-bench evaluates agentic coding by tasking
models with resolving ~2,294 real-world GitHub issues from 12 popular
Python repositories. This requires understanding multi-file codebases,
diagnosing bugs from issue descriptions, and generating a correct patch,
simulating an actual developer workflow.
Performance and Difficulty Analysis:
The task is extremely hard; early models like Claude 2 solved only ~2%
of tasks. However, progress has been rapid. The "SWE-Bench Verified"
leaderboard now shows Claude 4 Sonnet at 72.7%, Claude 4 Opus at 72.5%,
and OpenAI o3 at 69.1%. While improving, these scores are far from
perfect, keeping it in the high-difficulty category.
Relevance and Implication:
SWE-bench is a premier benchmark for measuring the practical utility of
LLMs in realistic software maintenance. Its focus on the complex,
iterative, and contextual behaviors required for fixing bugs makes it a
vital test for agentic capabilities.
Tier 3: The Agentic & Adaptive Challenge (High to Moderate Difficulty)
This
tier includes modern benchmarks focused on dynamic reasoning, tool use,
and contamination-free coding, where top models perform well but have
significant room for improvement.
6. GRIND
Difficulty Level: Moderately High
Description:
GRIND is a benchmark designed to specifically measure adaptive
reasoning—testing how well a model can adjust to new information and
solve novel problems rather than relying on pre-learned patterns from
its training data.
Performance and Difficulty Analysis:
Gemini 2.5 Pro leads this benchmark with a score of 82.1%, followed by
Claude 4 Sonnet at 75% and Claude 4 Opus at 67.9%. These scores, while
strong, are not near saturation and show clear differentiation among top
models.
Relevance and Implication:
GRIND's focus on adaptive reasoning is a direct response to the
limitations of static benchmarks. This capability is crucial for
developing robust and reliable AI that can generalize to unpredictable
real-world situations, distinguishing true intelligence from
sophisticated pattern matching.
7. BFCL (Berkeley Function-Calling Leaderboard)
Difficulty Level: Moderately High
Description:
BFCL is a comprehensive benchmark for evaluating an LLM's ability to
use tools by calling functions and APIs. It includes 2,000
question-function-answer pairs across multiple languages (Python, Java,
SQL, etc.) and domains, testing simple, parallel, and multi-step
function calls in stateful, agentic scenarios.
Performance and Difficulty Analysis:
The open-source model Llama 3.1 405b currently leads with a score of
81.1%, followed by Llama 3.3 70b at 77.3% and GPT-4o at 72.08%. The
strong performance of open-source models is a notable trend.
Relevance and Implication:
Tool use is critical for enabling LLMs to interact with external
systems. BFCL is pivotal for assessing this practical skill and reveals
that long-horizon reasoning, memory, and dynamic decision-making in
agentic settings remain open challenges.
8. MMMU (Massive Multi-discipline Multimodal Understanding)
Difficulty Level: Moderate
Description:
MMMU evaluates college-level knowledge across diverse subjects that
require understanding both images and text simultaneously. It tests the
integration of visual perception with domain-specific knowledge.
Performance and Difficulty Analysis:
OpenAI o3 leads with a score of 82.9%, closely followed by Gemini 2.5
Pro at 81.7%. The relatively small gap between top models suggests this
capability, while challenging, is becoming more standardized across the
frontier.
Relevance and Implication:
As AI moves beyond text-only models, the ability to reason across
different modalities is essential. MMMU is a key benchmark for this next
generation of AI.
9. LiveCode Bench
Difficulty Level: Moderate
Description:
LiveCode Bench provides a contamination-free evaluation for coding by
continuously updating its problems from real programming contests held
after model training cutoffs. This prevents data leakage and tests
genuine problem-solving.
Performance and Difficulty Analysis:
The benchmark shows OpenAI's o3-mini leading at 74.1%, with Gemini 2.5
Pro achieving 70.4%. The fact that a smaller model leads suggests
interesting dynamics in specialized training.
Relevance and Implication:
This benchmark is highly valuable for its contamination-free design. It
provides a more accurate assessment of a model's true coding abilities,
as opposed to its capacity for memorization.
10. Newer Coding Benchmarks (ClassEval, BigCodeBench, APPS, MBPP)
Difficulty Level: Moderate
Description: This group represents a spectrum of coding challenges beyond the classic HumanEval.
ClassEval (2023):
Focuses on generating entire Python classes, testing higher-level code
structure. It's a small set (100 tasks) but shows GPT-4 significantly
outperforming other models.
BigCodeBench (2024):
A large (1,140 problem) benchmark for "complex, library-rich" Python
tasks, created to be contamination-free with high test rigor.
APPS (2021): A very large (10,000 problems) benchmark with a wide difficulty range, from simple tasks to complex algorithmic problems.
MBPP (2022): A larger (~1,000 problems) set of entry-level Python problems.
Relevance and Implication:
Together, these benchmarks provide a much richer picture of coding
ability than a single test. They evaluate everything from basic syntax
(MBPP) to complex algorithms (APPS), structured code (ClassEval),
real-world library use (BigCodeBench), and bug fixing (SWE-bench).
Tier 4: The Zone of Saturation (Low Difficulty)
These
benchmarks were once challenging but are now being consistently solved
by top models, rendering them less useful for differentiating frontier
capabilities. High scores here are now considered table stakes.
11. GPQA Diamond
Difficulty Level: Low (for AI)
Description:
The GPQA (General-Purpose Question Answering) Diamond benchmark tests
graduate-level reasoning in biology, physics, and chemistry with
"Google-proof" questions developed by PhD-level experts.
Performance and Difficulty Analysis:
Top models now exceed the performance of the human experts who created
the test. While experts score around 65-74%, Gemini 2.5 Pro scores 86.4%
and Grok 3 [Beta] scores 84.6%. With scores this high, the benchmark is
nearing saturation.
Relevance and Implication:
GPQA Diamond demonstrates that AI is capable of superhuman performance
in specialized scientific reasoning. Its approaching saturation is a
testament to the rapid progress in this domain.
12. AIME (American Invitational Mathematics Examination)
Difficulty Level: Low (Saturated)
Description:
The AIME benchmark uses problems from a highly competitive high school
mathematics competition to test olympiad-level reasoning.
Performance and Difficulty Analysis:
AIME is a prominent example of saturation. On AIME 2025, OpenAI o4-mini
scores 93.4%, with Grok 3 [Beta] at 93.3%. When given access to a
Python interpreter, o4-mini reaches a near-perfect 99.5%, effectively
solving the benchmark.
Relevance and Implication:
The public availability of AIME questions raises significant concerns
about data contamination. The stellar performance of top models
indicates this benchmark no longer poses a difficult challenge,
highlighting the need for harder tests like FrontierMath and MATH.
13. MGSM (Multilingual GSM8K)
Difficulty Level: Low (Saturated)
Description:
MGSM evaluates multilingual capabilities by translating 250
grade-school math word problems from the GSM8K dataset into 10 different
languages. It tests if reasoning skills transfer across languages.
Performance and Difficulty Analysis:
Top models have effectively solved this benchmark. Claude 3.5 Sonnet
and Meta Llama 3.1 405b have tied for first place with 91.60% accuracy.
Relevance and Implication:
The high scores on MGSM demonstrate significant progress in creating
LLMs that are language-agnostic in their core problem-solving abilities,
a vital capability for global deployment.
14. MMLU (Massive Multitask Language Understanding)
Difficulty Level: Very Low (Saturated and Outdated)
Description: MMLU was a foundational benchmark designed to test broad knowledge and reasoning across 57 subjects.
Performance and Difficulty Analysis:
Top models now routinely achieve over 90% accuracy (e.g.,
DeepSeek-R1-0528 at 90.8%). Many modern leaderboards, such as Vellum
AI's, explicitly exclude MMLU, deeming it outdated.
Relevance and Implication:
MMLU is a victim of its own success and the rapid progress of AI. Its
saturation was a primary catalyst for the development of HLE and
MMLU-Pro. Furthermore, its reliability has been questioned, with a June
2024 paper noting that 57% of its virology questions harbor ground-truth
errors.
Tier 5: Qualitative, Ethical, and Foundational Benchmarks
This
final tier includes benchmarks that are either foundational (older and
simpler), or that evaluate dimensions other than pure performance, such
as human preference, truthfulness, and fairness. They are not ranked by
score but are essential for a holistic understanding of LLM
capabilities.
15. Human Preference and Interaction: Chatbot Arena
Description:
Chatbot Arena provides a scalable, crowdsourced ranking of LLMs based
on human preferences. Users engage in side-by-side, anonymous
conversations with two models and vote for the superior response. The
platform uses the Elo rating system to create a unique, ordered ranking.
Performance and Analysis: As of July 2025, Gemini 2.5 Pro leads with an Elo rating of 1473, followed by ChatGPT-4o-latest at 1428.
Relevance and Implication:
Chatbot Arena is critical for validating a model's practical utility
and alignment with human expectations, especially for open-ended
generative tasks where no single "correct" answer exists. It provides an
essential, real-world complement to objective, task-specific metrics.
16. Truthfulness, Honesty, and Safety (TruthfulQA, MASK, BeaverTails, SafetyBench)
Description: This group of benchmarks assesses a model's adherence to facts and safety principles.
TruthfulQA (2022):
Contains 817 questions designed to elicit common misconceptions to
measure a model's ability to provide truthful answers over plausible
falsehoods. While foundational, it is now less challenging for larger
models.
MASK (2025):
A brand-new benchmark from the Center for AI Safety and Scale AI that
tests for knowing deception. It first elicits a model's belief, then
applies pressure to lie, checking for contradictions. This disentangles
honesty from simple factual accuracy.
BeaverTails (2023):
A massive safety alignment dataset with over 333,000 human-annotated
question-answer pairs rated for helpfulness and harmlessness, used for
fine-tuning and evaluating safe outputs.
SafetyBench: A benchmark mentioned for evaluating safety, with specialized datasets and leaderboards.
Relevance and Implication:
These benchmarks are crucial for developing trustworthy and safe AI.
They highlight the difference between factual recall, hallucination, and
deliberate deception, pushing the field to evaluate not just what a model knows, but how it communicates that knowledge under different conditions.
17. Fairness and Factual Recall (WorldBench, FACT-BENCH)
Description: These benchmarks test the equity and comprehensiveness of a model's knowledge.
WorldBench (2024):
Evaluates geographic fairness in factual recall by querying models for
country-specific World Bank statistics. It found that all models tested
perform significantly worse on African and low-income countries.
FACT-BENCH (2024):
A large-scale benchmark for factual recall across 20 domains, testing
memorized knowledge from pre-training data. It found a large gap to
perfect recall even in GPT-4.
Relevance and Implication:
These tests provide a critical audit of an LLM's knowledge base,
revealing biases and gaps that can have significant real-world
consequences. They push developers to create more equitable and
comprehensive models.
18. Foundational Language Understanding (HellaSwag, LAMBADA, GLUE, SuperGLUE)
Description: This group includes older but still relevant benchmarks for common sense and language understanding.
HellaSwag & LAMBADA:
Test common sense and discourse understanding by asking models to
predict plausible sentence endings or fill in missing words that require
broad context.
GLUE & SuperGLUE:
Collections of diverse NLP tasks (e.g., sentiment analysis, textual
entailment) that formed a "progress ladder" for the field. SuperGLUE was
created to be a more challenging successor to GLUE as models began to
master it.
Relevance and Implication:
These benchmarks were foundational in driving progress in language
understanding. While many are now considered solved or less challenging,
they remain important for historical context and for evaluating
smaller, less capable models.
Conclusion and the Path Forward
The
comprehensive hierarchy of LLM benchmarks presented in this article
paints a vivid picture of a field defined by staggering progress and
equally significant challenges. The stark performance difference between
saturated benchmarks like AIME and MMLU, where top models achieve over
90% accuracy, and frontier tests like Humanity's Last Exam and
FrontierMath, where the same models struggle to score above 25%,
illustrates the uneven landscape of current AI capabilities. Models have
achieved superhuman proficiency in narrow, well-defined domains but are
still far from possessing the robust, generalizable, and adaptive
intelligence characteristic of human experts.
The
future of LLM evaluation is clearly moving towards more holistic,
challenging, and real-world-relevant paradigms that address the
limitations of current methods. Key emerging trends that will define the
next generation of benchmarks include:
Adversarial and Safety Testing: A greater focus on robustness, with benchmarks like HarmBench and JailbreakBench designed to intentionally uncover vulnerabilities and ensure models are safe and aligned.
Long-Context and Multimodal Evaluation: As models ingest entire books (NovelQA) and process video and audio (EnigmaEval, MMBench), evaluation must scale to assess comprehension and reasoning across vast and varied data streams.
Agentic and Real-World Evaluation:
The most significant trend is the shift toward evaluating autonomous
agents. This requires assessing multi-step reasoning, tool use,
planning, and self-correction in dynamic environments, moving far beyond
static Q&A.
Domain-Specific and Ethical Audits:
The rise of specialized models (e.g., for medicine or finance) and a
greater awareness of ethical risks necessitate tailored benchmarks that
evaluate performance, safety, and fairness within specific contexts (MedSafetyBench, WorldValuesBench).
For
researchers and practitioners, this ordered landscape provides crucial
guidance. The success of reasoning-focused architectures on difficult
benchmarks suggests that future advances will come from improving
deliberative thinking processes, not just scaling parameters. The
importance of contamination-free and agentic evaluations indicates that
these should be key areas of focus. Ultimately, as AI models become more
powerful, the yardsticks by which we measure them must become
proportionally more sophisticated to guide meaningful and responsible
progress toward the future of artificial intelligence.
Please note: This article was constructed with AI, it may be inaccurate or missing information.