StatEval, developed by the team of Professor Fan Zhou at Shanghai University of Finance and Economics, is the first benchmark systematically organized along both difficulty and disciplinary axes to evaluate large language models’ statistical reasoning.
It includes a Foundational Knowledge Dataset of over 13,000 problems from 50+ textbooks and a Statistical Research Dataset of over 2,000 proof-based questions sourced from 18 top-tier journals in statistics, probability, econometrics, and machine learning.
The test sets of both datasets are publicly available and can be accessed on Hugging Face:
📘 Foundational Knowledge Dataset:
View on Hugging Face
📗 Statistical Research Dataset:
View on Hugging Face
Overall structure of the StatEval benchmark, showing the two datasets — Foundational Knowledge and Statistical Research — along with their hierarchical organization and representative task examples.
- 6,336 undergraduate-level questions (e.g., elementary probability, linear regression, basic ML).
- 7,481 graduate-level questions (e.g., advanced probability, empirical processes, causal inference).
- Sourced from 45+ textbooks, preliminary exam, and curated course materials.
Composition and distribution of the Foundational Knowledge Dataset, showing proportions across subfields, educational levels, and problem formats.
- 2,374 proof-based tasks (extracted from 2,719 journal papers across 18 venues).
- 8 research subdomains (e.g., classical inference, high-dimensional modeling, Bayesian nonparametrics).
- 8 theoretical property categories (e.g., asymptotic properties, optimality results, identifiability).
Composition and distribution of the Statistical Research Dataset, illustrating coverage across research subdomains and theoretical property categories.
Performance comparison of various models on two datasets across different statistical topics and difficulty levels.
Rank | Model | Probability | Statistics | Machine Learning | Undergraduate Mean |
---|
A comprehensive breakdown of the first dedicated benchmark for statistical reasoning in LLMs
The StatEval pipeline extracts and standardizes problems from diverse academic sources using LLMs (GPT-5, Gemini series) combined with human-in-the-loop verification. It consists of five core stages that convert raw documents into high-quality, structured evaluation data.
Converts PDFs, scanned files, and LaTeX sources into structured text using multi-modal LLMs like MinerU.
Extracts theorems and relevant context using LLM-driven patterns (Gemini series), ensuring each fragment is self-contained.
Transforms theorems and context into QA pairs using GPT-5, following rules for difficulty, self-containment, single-answer constraints, and quantitative verifiability.
Automated validation with GPT-5 checks rubric compliance and consistency before human review.
Experts verify semantic correctness, difficulty, and dataset classification. Feedback is used to improve agents iteratively.
This pipeline enables fully automated conversion of scholarly materials into standardized, high-quality evaluation datasets for statistical reasoning tasks.
Illustration of the StatEval data processing pipeline.
Open-ended questions, including those from both the Foundational Knowledge and Statistical Research datasets, are evaluated through a process-based scoring pipeline designed to assess both final correctness and the quality of intermediate reasoning.
The model’s response is parsed to identify key reasoning steps, including assumptions, logical transitions, and intermediate derivations. This reconstructs the complete reasoning chain to capture how the final result is obtained.
Each reasoning step is further analyzed to extract quantitative or symbolic outcomes (e.g., computed values, derived expressions, identified distributions), ensuring that both logic and intermediate results are available for verification.
A dedicated LLM evaluator compares the extracted reasoning steps and outcomes with the reference solution, verifying correctness, necessity, and logical consistency of each step.
Each step is assigned binary scores along three dimensions: Reasoning Accuracy, Step Completeness, and Final Answer Correctness. Aggregated scores for one evaluation pass are computed as:
Sfinal(i) = α Sr(i) + β Ss(i) + (1 - α - β) Sa(i)
With α = 0.4, β = 0.3, and binary scores Sr(i), Ss(i), Sa(i) ∈ {0,1}. Scoring is repeated three times, and the final score is:
Sfinal = min{Sfinal(1), Sfinal(2), Sfinal(3)}
This four-step design separates reasoning reconstruction from correctness judgment, enabling fine-grained and interpretable evaluation. The framework outputs two complementary indicators: (1) a final score reflecting overall correctness, and (2) a process score reflecting reasoning quality and stepwise consistency.
Illustration of the evaluation pipeline for open-ended questions in StatEval.
All benchmark resources of StatEval are publicly available to ensure transparency and reproducibility. The two test datasets can be accessed via Hugging Face, while implementation details, evaluation scripts, and prompt templates are available on GitHub.
Undergraduate and graduate-level statistical questions from textbooks, exams, and curated teaching materials.
→ View on Hugging FaceProof-based and theoretical problems extracted from peer-reviewed research papers, covering advanced statistical subfields.
→ View on Hugging FaceThe full implementation, evaluation pipeline, and reproducibility instructions can be found on GitHub:
→ Visit GitHub RepositoryCode coming soon — Our paper is temporarily hosted on our GitHub repository, and the code is currently being prepared. For questions or early access requests, please contact us at: zhoufan@mail.shufe.edu.cn