StatEval: A Comprehensive Benchmark for Large Language Models in Statistics

StatEval, developed by the team of Professor Fan Zhou at Shanghai University of Finance and Economics, is the first benchmark systematically organized along both difficulty and disciplinary axes to evaluate large language models’ statistical reasoning.

It includes a Foundational Knowledge Dataset of over 13,000 problems from 50+ textbooks and a Statistical Research Dataset of over 2,000 proof-based questions sourced from 18 top-tier journals in statistics, probability, econometrics, and machine learning.

The test sets of both datasets are publicly available and can be accessed on Hugging Face:

📘 Foundational Knowledge Dataset: View on Hugging Face
📗 Statistical Research Dataset: View on Hugging Face

A big picture of the StatEval Benchmark

StatEval Overview Diagram

Overall structure of the StatEval benchmark, showing the two datasets — Foundational Knowledge and Statistical Research — along with their hierarchical organization and representative task examples.

Foundational Knowledge Dataset

- 6,336 undergraduate-level questions (e.g., elementary probability, linear regression, basic ML).
- 7,481 graduate-level questions (e.g., advanced probability, empirical processes, causal inference).
- Sourced from 45+ textbooks, preliminary exam, and curated course materials.

Foundational Knowledge Dataset Composition and Distribution

Composition and distribution of the Foundational Knowledge Dataset, showing proportions across subfields, educational levels, and problem formats.

Statistical Research Dataset

- 2,374 proof-based tasks (extracted from 2,719 journal papers across 18 venues).
- 8 research subdomains (e.g., classical inference, high-dimensional modeling, Bayesian nonparametrics).
- 8 theoretical property categories (e.g., asymptotic properties, optimality results, identifiability).

Statistical Research Dataset Composition and Distribution

Composition and distribution of the Statistical Research Dataset, illustrating coverage across research subdomains and theoretical property categories.

Leaderboard

Performance comparison of various models on two datasets across different statistical topics and difficulty levels.

Undergraduate Performance - Foundational Knowledge Dataset

Data Update Time: 2025-10-08
Rank Model Probability Statistics Machine Learning Undergraduate Mean

Overview

A comprehensive breakdown of the first dedicated benchmark for statistical reasoning in LLMs

1. Data Processing Pipeline

The StatEval pipeline extracts and standardizes problems from diverse academic sources using LLMs (GPT-5, Gemini series) combined with human-in-the-loop verification. It consists of five core stages that convert raw documents into high-quality, structured evaluation data.

1. File Conversion

Converts PDFs, scanned files, and LaTeX sources into structured text using multi-modal LLMs like MinerU.

2. Context Segmentation

Extracts theorems and relevant context using LLM-driven patterns (Gemini series), ensuring each fragment is self-contained.

3. Problem Generation

Transforms theorems and context into QA pairs using GPT-5, following rules for difficulty, self-containment, single-answer constraints, and quantitative verifiability.

4. Quality Control

Automated validation with GPT-5 checks rubric compliance and consistency before human review.

5. Human Check & Feedback

Experts verify semantic correctness, difficulty, and dataset classification. Feedback is used to improve agents iteratively.

This pipeline enables fully automated conversion of scholarly materials into standardized, high-quality evaluation datasets for statistical reasoning tasks.

StatEval data processing pipeline

Illustration of the StatEval data processing pipeline.

2. Evaluation Methodology

Open-ended questions, including those from both the Foundational Knowledge and Statistical Research datasets, are evaluated through a process-based scoring pipeline designed to assess both final correctness and the quality of intermediate reasoning.

1. Reasoning Step Extraction

The model’s response is parsed to identify key reasoning steps, including assumptions, logical transitions, and intermediate derivations. This reconstructs the complete reasoning chain to capture how the final result is obtained.

2. Outcome Extraction

Each reasoning step is further analyzed to extract quantitative or symbolic outcomes (e.g., computed values, derived expressions, identified distributions), ensuring that both logic and intermediate results are available for verification.

3. LLM Judging

A dedicated LLM evaluator compares the extracted reasoning steps and outcomes with the reference solution, verifying correctness, necessity, and logical consistency of each step.

4. Scoring

Each step is assigned binary scores along three dimensions: Reasoning Accuracy, Step Completeness, and Final Answer Correctness. Aggregated scores for one evaluation pass are computed as:

Sfinal(i) = α Sr(i) + β Ss(i) + (1 - α - β) Sa(i)

With α = 0.4, β = 0.3, and binary scores Sr(i), Ss(i), Sa(i) ∈ {0,1}. Scoring is repeated three times, and the final score is:

Sfinal = min{Sfinal(1), Sfinal(2), Sfinal(3)}

This four-step design separates reasoning reconstruction from correctness judgment, enabling fine-grained and interpretable evaluation. The framework outputs two complementary indicators: (1) a final score reflecting overall correctness, and (2) a process score reflecting reasoning quality and stepwise consistency.

StatEval evaluation pipeline

Illustration of the evaluation pipeline for open-ended questions in StatEval.

3. Reproducibility Details

All benchmark resources of StatEval are publicly available to ensure transparency and reproducibility. The two test datasets can be accessed via Hugging Face, while implementation details, evaluation scripts, and prompt templates are available on GitHub.

📘 Foundational Knowledge Dataset

Undergraduate and graduate-level statistical questions from textbooks, exams, and curated teaching materials.

→ View on Hugging Face

📗 Statistical Research Dataset

Proof-based and theoretical problems extracted from peer-reviewed research papers, covering advanced statistical subfields.

→ View on Hugging Face

🧩 Implementation & Evaluation Code

The full implementation, evaluation pipeline, and reproducibility instructions can be found on GitHub:

→ Visit GitHub Repository

Code coming soon — Our paper is temporarily hosted on our GitHub repository, and the code is currently being prepared. For questions or early access requests, please contact us at: zhoufan@mail.shufe.edu.cn