StatEval

A big picture of the StatEval Benchmark

Overall structure of the StatEval benchmark, showing the two datasets — Foundational Knowledge and Statistical Research — along with their hierarchical organization and representative task examples.

Foundational Knowledge Dataset

- 6,336 undergraduate-level questions (e.g., elementary probability, linear regression, basic ML).
- 7,481 graduate-level questions (e.g., advanced probability, empirical processes, causal inference).
- Sourced from 45+ textbooks, preliminary exam, and curated course materials.

Composition and distribution of the Foundational Knowledge Dataset, showing proportions across subfields, educational levels, and problem formats.

Statistical Research Dataset

- 2,374 proof-based tasks (extracted from 2,719 journal papers across 18 venues).
- 8 research subdomains (e.g., classical inference, high-dimensional modeling, Bayesian nonparametrics).
- 8 theoretical property categories (e.g., asymptotic properties, optimality results, identifiability).

Composition and distribution of the Statistical Research Dataset, illustrating coverage across research subdomains and theoretical property categories.

Leaderboard

Performance comparison of various models on two datasets across different statistical topics and difficulty levels.

Dataset Type:

Category:

Undergraduate Performance - Foundational Knowledge Dataset

Data Update Time: 2025-10-08

Rank	Model	Probability	Statistics	Machine Learning	Undergraduate Mean

Graduate Performance - Foundational Knowledge Dataset

Data Update Time: 2025-10-08

Rank	Model	Probability	Statistics	Machine Learning	Graduate Mean

Overall Performance - Foundational Knowledge Dataset

Data Update Time: 2025-10-08

Rank	Model	Undergraduate Mean	Graduate Mean	Overall Mean

Research Area Performance - Statistical Research Dataset

Data Update Time: 2025-10-08

Rank	Model	Probability	Statistics	Machine Learning	Mean

Theoretical Property Performance - Statistical Research Dataset

Data Update Time: 2025-10-08

Rank	Model	Asymp	Conv	Dist	Gen	Ident	Opt	Struct	Test	Mean

Abbreviations:

Asymp = Asymptotic Properties; Conv = Convergence & Stability; Dist = Distributional Properties; Gen = Generalization & Error Bounds; Ident = Identifiability & Consistency; Opt = Optimality Results; Struct = Structural Guarantees; Test = Testing Validity.

Overview

A comprehensive breakdown of the first dedicated benchmark for statistical reasoning in LLMs

1. Data Processing Pipeline

The StatEval pipeline extracts and standardizes problems from diverse academic sources using LLMs (GPT-5, Gemini series) combined with human-in-the-loop verification. It consists of five core stages that convert raw documents into high-quality, structured evaluation data.

1. File Conversion

Converts PDFs, scanned files, and LaTeX sources into structured text using multi-modal LLMs like MinerU.

2. Context Segmentation

Extracts theorems and relevant context using LLM-driven patterns (Gemini series), ensuring each fragment is self-contained.

3. Problem Generation

Transforms theorems and context into QA pairs using GPT-5, following rules for difficulty, self-containment, single-answer constraints, and quantitative verifiability.

4. Quality Control

Automated validation with GPT-5 checks rubric compliance and consistency before human review.

5. Human Check & Feedback

Experts verify semantic correctness, difficulty, and dataset classification. Feedback is used to improve agents iteratively.

This pipeline enables fully automated conversion of scholarly materials into standardized, high-quality evaluation datasets for statistical reasoning tasks.

Illustration of the StatEval data processing pipeline.

2. Evaluation Methodology

Open-ended questions, including those from both the Foundational Knowledge and Statistical Research datasets, are evaluated through a process-based scoring pipeline designed to assess both final correctness and the quality of intermediate reasoning.

1. Reasoning Step Extraction

The model’s response is parsed to identify key reasoning steps, including assumptions, logical transitions, and intermediate derivations. This reconstructs the complete reasoning chain to capture how the final result is obtained.

2. Outcome Extraction

Each reasoning step is further analyzed to extract quantitative or symbolic outcomes (e.g., computed values, derived expressions, identified distributions), ensuring that both logic and intermediate results are available for verification.

3. LLM Judging

A dedicated LLM evaluator compares the extracted reasoning steps and outcomes with the reference solution, verifying correctness, necessity, and logical consistency of each step.

4. Scoring

Each step is assigned binary scores along three dimensions: Reasoning Accuracy, Step Completeness, and Final Answer Correctness. Aggregated scores for one evaluation pass are computed as:

S_final⁽ⁱ⁾ = α S_r⁽ⁱ⁾ + β S_s⁽ⁱ⁾ + (1 - α - β) S_a⁽ⁱ⁾

With α = 0.4, β = 0.3, and binary scores S_r⁽ⁱ⁾, S_s⁽ⁱ⁾, S_a⁽ⁱ⁾ ∈ {0,1}. Scoring is repeated three times, and the final score is:

S_final = min{S_final⁽¹⁾, S_final⁽²⁾, S_final⁽³⁾}

This four-step design separates reasoning reconstruction from correctness judgment, enabling fine-grained and interpretable evaluation. The framework outputs two complementary indicators: (1) a final score reflecting overall correctness, and (2) a process score reflecting reasoning quality and stepwise consistency.

Illustration of the evaluation pipeline for open-ended questions in StatEval.

3. Reproducibility Details

All benchmark resources of StatEval are publicly available to ensure transparency and reproducibility. The two test datasets can be accessed via Hugging Face, while implementation details, evaluation scripts, and prompt templates are available on GitHub.

📘 Foundational Knowledge Dataset

Undergraduate and graduate-level statistical questions from textbooks, exams, and curated teaching materials.

→ View on Hugging Face

📗 Statistical Research Dataset

Proof-based and theoretical problems extracted from peer-reviewed research papers, covering advanced statistical subfields.

→ View on Hugging Face

🧩 Implementation & Evaluation Code

The full implementation, evaluation pipeline, and reproducibility instructions can be found on GitHub:

→ Visit GitHub Repository

Code coming soon — Our paper is temporarily hosted on our GitHub repository, and the code is currently being prepared. For questions or early access requests, please contact us at: zhoufan@mail.shufe.edu.cn