Benchmarking AI Across Disciplines

The advancement of Large Language Models has significantly transformed human interaction with artificial intelligence, particularly in academic and professional domains. While many benchmarks [1] evaluate LLMs' proficiency in widely studied disciplines such as mathematics, physics, and computer science, a significant gap exists in assessing their capabilities across less common fields, including light industry, agriculture, and service-oriented disciplines. Addressing this limitation, the SuperGPQA [2] benchmark presents a novel and expansive evaluation framework, measuring graduate-level knowledge and reasoning abilities across 285 disciplines. This article delves into the methodologies, findings, and implications of SuperGPQA, highlighting its role in advancing LLM evaluation.

Methodology: A Human-LLM Collaborative Framework

SuperGPQA employs a structured, multi-stage methodology designed to ensure the highest standards of reliability, validity, and complexity in evaluating LLMs. The process integrates human expertise with machine intelligence to iteratively refine the dataset and maintain high-quality benchmarks. The methodology consists of three key phases: Source Screening, Transcription, and Quality Inspection.

1. Source Screening: Ensuring Question Credibility

The first phase involves the careful selection of credible question sources. Unlike other benchmarks that rely on crowd-sourced question generation, SuperGPQA limits this task to expert annotators who hold or are pursuing PhDs in relevant disciplines. These experts source questions from:

Textbooks and academic literature
Peer-reviewed journals
University exam banks
Problem sets from reputable institutions

This ensures that questions reflect real-world academic rigor and are free from the inaccuracies commonly found in online exercise platforms.

A key challenge in this stage is identifying materials that provide sufficiently complex and domain-specific knowledge without relying on widely available or memorized datasets. Expert annotators verify sources by providing screenshots or references, reducing risks of misinformation and ensuring diversity in question origins.

2. Transcription: Structuring and Refining Questions

Once the raw questions are collected, they undergo a meticulous transcription process to align them with SuperGPQA's standardized format. This phase involves:

Translation and Language Standardization: Questions sourced in non-English languages are translated into English using academic and technical language to ensure clarity and precision.
Conversion to Multiple-Choice Format: Non-MCQ (Multiple Choice Question) problems, such as open-ended or problem-solving questions, are reformulated into structured multiple-choice formats.
Distractor Generation: To ensure model discrimination, plausible but incorrect answer choices (distractors) are generated. Annotators follow strict guidelines to prevent obvious or misleading distractors, maintaining logical consistency.

Difficulty and Reliability Estimation: Questions are categorized by estimated difficulty level (easy, medium, hard) based on expert judgment and LLM performance during preliminary testing.

Rewriting Samples of Questions Requiring the Selection of Correct or Incorrect Options.

One of the major challenges in this phase is ensuring that distractors are sufficiently complex. Poorly designed distractors can lead to pattern recognition rather than genuine reasoning evaluation. To counter this, human annotators analyze how LLMs respond to different choices and iteratively refine answer sets.

3. Quality Inspection: Multi-Layered Validation Process

The final and most critical phase is Quality Inspection, where questions undergo a rigorous three-stage review process to filter out ambiguity, redundancy, and errors. The three layers of validation include:

Rule-Based Pre-Check: Automated tools detect basic formatting inconsistencies, duplicate questions, and common structural issues.
LLM-Based Quality Inspection: Multiple state-of-the-art LLMs (e.g., GPT-4o, Claude-3.5, DeepSeek-R1 [5]) are tested against the questions. Questions that result in consistent incorrect responses or fail to challenge high-performing models are flagged for revision.
Expert-Based Final Review: Human experts with unrestricted web access verify flagged questions, ensuring accuracy, relevance, and appropriate difficulty. This stage is crucial for identifying and correcting questions influenced by faulty external data sources.

Human-LLM Collaboration: Strengths and Challenges

SuperGPQA’s methodology demonstrates the strengths of combining human expertise with AI-driven evaluation. The involvement of experts ensures that the dataset remains rigorous and academically sound, while LLM evaluation provides an additional layer of quality control by identifying ambiguous or redundant items. This hybrid approach allows for a dynamic question refinement process based on real-world AI performance. By leveraging machine intelligence, the annotation process becomes more scalable and efficient, ensuring that large volumes of data can be processed without compromising quality. Additionally, SuperGPQA enhances discrimination power, ensuring that questions sufficiently differentiate between high and low-performing models, leading to a more accurate assessment of LLM capabilities.

However, challenges remain, particularly in balancing human judgment with LLM-based refinements. AI systems can efficiently flag potential errors, detect inconsistencies, and suggest alternative formulations, yet they often lack the nuanced understanding of subject matter that human experts provide. There are cases where AI-generated modifications might align with statistical patterns but deviate from domain-specific logic, necessitating expert intervention. Furthermore, while LLMs can help optimize the annotation workflow, excessive reliance on them without human oversight can introduce biases or oversimplifications in question design.

By structuring the methodology around a collaborative framework, SuperGPQA maximizes the strengths of both human and machine intelligence. This symbiotic relationship ensures that the benchmark remains both comprehensive and adaptable, continuously evolving with advancements in AI capabilities. The integration of iterative human-LLM feedback loops fosters a system where AI not only supports human experts but also benefits from human intuition and deep contextual understanding, setting a new standard in LLM evaluation.

Data Collection and Composition

SuperGPQA is built upon an extensive and meticulously curated dataset comprising 26,529 questions. These questions span 13 broad disciplines, including science, engineering, humanities, medicine, and law, and are further categorized into 72 fields and 285 subfields. This level of granularity allows for a more precise evaluation of LLM performance across highly specialized domains.

The Visualization of the Text Embeddings of Question-Answer Pairs Across Disciplines. They use the gte-large-en-v1.5 [3,4] encoding model and set the t-SNE parameters as: perplexity 100, learning rate 500, and 1000 iteration. For clearer visualization, they randomly select a maximum of 1000 samples from each discipline.

1. Discipline and Subfield Representation

One of the key features of SuperGPQA is its comprehensive disciplinary coverage. Unlike traditional benchmarks that focus on a limited set of mainstream subjects, SuperGPQA extends into niche fields such as:

Textile Science
Forestry Engineering
Veterinary Medicine
Military Science
Library and Information Sciences
Naval Architecture and Ocean Engineering

By including a broad array of disciplines, SuperGPQA ensures that LLMs are tested on knowledge areas that are often overlooked but are essential to human expertise.

2. Question Complexity and Structure

SuperGPQA prioritizes diversity in question structure and difficulty levels. Questions are categorized into three levels:

Easy: Straightforward factual recall or basic concept application.
Medium: Involves moderate reasoning or synthesis of information.
Hard: Requires deep understanding, multi-step reasoning, or mathematical calculations.

Approximately 42.33% of all questions involve complex reasoning or mathematical calculations, making SuperGPQA one of the most challenging benchmarks available. Moreover, each question is presented in a multiple-choice format, with an average of 9.67 answer options. This significantly increases the difficulty compared to conventional four-option multiple-choice questions, ensuring that models must engage in genuine reasoning rather than relying on elimination strategies.

3. Dataset Expansion and Refinement

The dataset is continuously updated through an iterative feedback loop that incorporates:

Expert Contributions: Subject-matter experts regularly review and contribute new questions to maintain the dataset’s academic rigor.
LLM Performance Analysis: Questions that are consistently answered correctly by top-performing LLMs are refined or replaced to maintain benchmark difficulty.
Cross-Disciplinary Verification: Questions undergo a secondary review process where experts from related disciplines validate accuracy and contextual appropriateness.

This dynamic approach ensures that SuperGPQA remains a relevant and evolving benchmark, adapting alongside advancements in AI capabilities.

4. Balancing Question Representation Across Fields

Although STEM disciplines dominate the dataset due to the nature of computational and analytical reasoning, SuperGPQA actively includes a significant portion of questions from humanities and social sciences. For instance:

Science and Engineering account for ~60% of the dataset
Humanities and Social Sciences represent ~25%
Medical, Law, and Economics fields collectively make up the remaining ~15%

This distribution ensures that while technical disciplines are well represented, there is also adequate evaluation of LLMs' ability to process, reason, and generate insights in non-technical domains.

Experimental Results: Evaluating State-of-the-Art LLMs

SuperGPQA assesses multiple categories of LLMs, including reasoning models, chat models, and base models. The evaluation framework ensures a comprehensive understanding of LLM performance across multiple levels of complexity and disciplinary breadth. Models are tested under different conditions, such as zero-shot, few-shot, and instruction-tuned settings, to assess their adaptability and reasoning capabilities.

Left: Radar Chart. Discrimination: The degree of distinction between different models. Climbing Space: The remaining improvement space for the SOTA models. Corr. with Arena: Correlation with Chatbot Arena Elo scores. Right: Performance comparison of SOTA models across different benchmarks.

1. Model Categories and Performance Overview

LLMs evaluated under SuperGPQA fall into three primary categories:

Reasoning Models: These models are optimized for logical deduction, multi-step problem-solving, and complex analytical tasks. Examples include DeepSeek-R1 and o1-2024-12-17 [6].
Chat Models: Designed for general interactions, these models demonstrate conversational ability but may struggle with structured reasoning. Examples include GPT-4o and Claude-3.5.
Base Models: Foundational LLMs that lack extensive fine-tuning, providing insights into the raw capabilities of neural architectures before optimization.

The best-performing reasoning models, such as DeepSeek-R1, achieve an accuracy of 61.82%, indicating significant room for improvement in AI's ability to process advanced academic and professional knowledge.

2. Performance by Discipline

Model performance varies widely across disciplines, revealing strengths and weaknesses in current LLMs. Key observations include:

STEM Dominance: Models perform best in mathematics, engineering, and computer science, where structured knowledge allows pattern recognition and systematic problem-solving.
Challenges in Humanities: Fields such as history, literature, and philosophy present difficulties due to the abstract, context-dependent nature of knowledge in these areas.
Medical and Legal Reasoning: Performance is mixed, with some models excelling in procedural knowledge (e.g., medical diagnostics) but struggling with case-based reasoning in law.

The disparity in performance highlights the need for specialized training datasets tailored to non-STEM fields.

3. The Impact of Instruction Tuning

Instruction-tuned models, which undergo fine-tuning with human feedback, consistently outperform their base counterparts. For example:

DeepSeek-V3-Instruct scores 47.40%, compared to 32.14% for its base model, demonstrating a 15% performance increase.
Qwen2.5-72B-Instruct outperforms its base version by over 10 percentage points, confirming that tailored instruction tuning improves contextual reasoning.

These findings emphasize the effectiveness of human-aligned fine-tuning in making LLMs more adept at complex queries.

4. Difficulty-Based Performance Variance

SuperGPQA categorizes questions into three difficulty tiers:

Easy: Fact-based, requiring basic recall.
Medium: Involves moderate reasoning and synthesis.
Hard: Demands multi-step inference and abstract thought.

Even top models exhibit a significant drop in accuracy when facing hard questions, underscoring the limitations of current LLMs in advanced reasoning tasks. For example:

DeepSeek-R1 scores 63.59% on easy questions but drops to 56.87% on hard questions.
GPT-4o sees an even sharper decline, struggling to maintain accuracy on complex multi-step problems.

This difficulty-based breakdown suggests that while LLMs can efficiently handle structured information, they still require substantial improvements in deep reasoning.

5. Benchmark Comparison and Model Evolution

SuperGPQA was designed to measure model improvement over time. Comparing past and current models shows steady but incremental progress:

GPT-4o-2024-11-20 outperforms GPT-4o-2024-05-13 by nearly 5 percentage points, illustrating the rapid pace of AI model enhancements.
Claude-3.5 outperforms Claude-3 across nearly all benchmarks, reinforcing the impact of iterative updates in language model training.

These results indicate that model evolution trends toward more sophisticated reasoning, but the gap between artificial and human-level intelligence remains significant.

Implications and Future Directions

SuperGPQA underscores the need for broader and more rigorous evaluation methodologies in LLM research. The findings highlight key areas for future improvements and the evolution of artificial intelligence systems.

One of the major challenges identified is the inconsistent performance of LLMs across disciplines, with significant discrepancies between STEM and non-STEM fields. Future development should focus on expanding training datasets to include more specialized knowledge in law, philosophy, and humanities. Integrating expert-verified corpora will provide nuanced, domain-specific knowledge beyond general internet sources, while developing fine-tuned models that specialize in specific disciplines will improve accuracy in complex fields such as medicine and law.

Many complex real-world reasoning tasks require more than just textual knowledge. Future LLM benchmarks and training processes should explore the integration of visual and spatial data to enhance understanding in fields such as engineering, medicine, and design. The ability to interpret graphs, charts, or images alongside textual data will improve performance in multimodal reasoning tasks. Additionally, better handling of multi-step problem-solving by combining logical deduction with external tools like computational solvers will be crucial for tackling higher-order reasoning challenges.

Static benchmarks often become obsolete as LLMs improve. To ensure continued progress, future benchmarks should be dynamic and self-updating, incorporating new question types and fields as AI models advance. More interactive frameworks with real-time human feedback loops will help refine model evaluation, ensuring that AI systems remain adaptable. Moreover, aligning benchmarks with real-world applications will allow AI models to be tested in practical decision-making scenarios rather than on purely synthetic datasets.

The findings from SuperGPQA reveal that while generalist LLMs perform well on broad knowledge tasks, they struggle in areas requiring deep expertise. Future research should focus on hybrid AI architectures that combine large-scale general models with smaller, domain-specific fine-tuned submodels. Adaptive learning approaches, where models improve performance in specific areas through iterative exposure and expert-guided reinforcement, will enhance model accuracy. Additionally, more robust evaluation frameworks should be developed to assess not only knowledge recall but also true comprehension and application in professional contexts.

Conclusion

SuperGPQA represents a significant step forward in evaluating LLMs across diverse academic disciplines. By setting a new standard for assessment, it provides valuable insights for AI researchers and developers, paving the way for more robust and versatile AI systems. The benchmark’s findings highlight both the strengths and limitations of current models, offering a roadmap for future AI advancements.

References

[1] Measuring Intelligence: Key Benchmarks and Metrics for LLMs, Transcendent AI

[2] Du, X., Yao, Y., Ma, K., Wang, B., Zheng, T., Zhu, K., ... & P Team. (2025).

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines. arXiv preprint arXiv:2502.14739.

[3] Zhang, X., Zhang, Y., Long, D., Xie, W., Dai, Z., Tang, J., ... & Zhang, M. (2024). mgte: Generalized long-context text representation and reranking models for multilingual text retrieval. arXiv preprint arXiv:2407.19669.

[4] Li, Z., Zhang, X., Zhang, Y., Long, D., Xie, P., & Zhang, M. (2023). Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281.

[5] DeepSeek, the game-changing model, Transcendent AI

[6] Introducing OpenAI o1, OpenAI