Datasets for advancing Theoretical Physics and AI

There is a lack of curated datasets in theoretical physics to train better machine learning models. But what exactly is missing and how can we fill the gaps?

13th April 2025

The history of recent developments in deep learning shows the crucial role played by curated datasets. For example, Fei-Fei Li and her collaborators dramatically reshaped computer vision with the creation of ImageNet, a large-scale, labeled image collection. This sparked the start of the deep learning revolution. Similarly, datasets like CIFAR-10 and MNIST have provided foundational benchmarks essential for algorithmic progress.

Despite these advances in machine learning, theoretical physics still lacks comprehensive, standardized datasets. Developing high-quality datasets specifically tailored for theoretical physics could accelerate progress both in AI—by enabling more powerful models—and in physics itself, by establishing common benchmarks for training and evaluating physics-related models.

In this post, I start by looking to the current existing physics related datasets by domain, data type, level of content and availability. Then I try to identify current existing gaps and propose new dataset creations.

Existing Datasets

Theoretical Physics (Knowledge & Simulations)

This includes textual corpora of theory papers, equation datasets, and simulation data of theoretical models.

Dataset / Source Domain & Content Type Level Availability
ArXiv Physics Corpus All physics subfields (theory & experiment) – 1.2M+ research papers (PhysBERT: A Text Embedding Model for Physics Scientific Literature) Text (papers, PDFs) Frontier research Open-access (arXiv)
Physics Journals (e.g. APS) Broad physics research literature (peer-reviewed journals) Text (papers) Frontier research Restricted (subscription)
Feynman Symbolic Regression Dataset Classical physics formulas (from Feynman Lectures, etc.) – 100+ laws Symbolic equations + numeric data Undergrad–Graduate Open (research dataset)
Kreuzer–Skarke Calabi–Yau DB String theory – 473,800,776 reflexive 4D polytopes (Calabi–Yau manifolds) (Group-invariant machine learning on the Kreuzer-Skarke dataset) Structured (geometric data) Frontier research Open (online database)
Lattice QCD Configurations Quantum Field Theory (lattice QCD) – gauge field samples, correlation functions Numeric (lattice data) Frontier research Partially open (example)
SXS Waveforms (Simulating eXtreme Spacetimes) General Relativity – Numerical relativity waveforms of binary black holes (SXS Gravitational Waveform Database) Numerical time-series Frontier research Open (public catalog)

Experimental Physics

Datasets from experiments and simulations that test physical theories, often used to train ML models to detect patterns or surrogate models for experiments.

Dataset / Source Domain & Content Type Level Availability
CERN Open Data (LHC) High-energy physics – Petabytes of LHC collision data (ATLAS, CMS, etc.) Numerical (events, detector readings) Frontier research Open-access (portal)
HEP ML Datasets (HIGGS, HEPMASS, etc.) Particle physics – Simulated collision events labeled as Higgs vs. background Numerical (tabular features) Graduate/Research Open (UCI/Zenodo)
LIGO/Virgo GWOSC Gravitational waves – Time-series signals from interferometers (event strain data) Numerical (time-series) Frontier research Open (GWOSC portal)
Quantum Optics Experiments Quantum optics – e.g. single-photon interference, trapped-ion measurements Numeric (experimental logs, time-series) Graduate/Research Limited open (lab repositories, e.g. QDataSet)
Fluid Dynamics/CFD Simulations Classical mechanics – CFD simulation outputs (e.g. flow fields, turbulence) Numerical (grids, images) Graduate/Research Partially open (benchmarks, e.g. NASA CFD data)
Graph Network Simulations Multi-body physics – Synthetic trajectories for n-body, fluids, rigid bodies (Physics Simulation With Graph Neural Networks Targeting Mobile - Mobile, Graphics, and Gaming blog - Arm Community blogs - Arm Community) Numeric (graph-based, trajectories) Undergrad–Graduate Partially open (code to generate; DeepMind GNS data)

Mathematics for Physics

Datasets of mathematical problems, proofs, and symbolic computations relevant to physics problem-solving and theory.

Dataset / Source Domain & Content Type Level Availability
MATH Dataset (Hendrycks et al.) 12,500 competition math problems with step-by-step solutions ([2103.03874] Measuring Mathematical Problem Solving With the MATH Dataset) Text (problem ⇒ solution) Undergrad (contest) Open (public dataset)
PhysQA 1,008 physics word problems (mechanics, etc.) with annotated solutions Text (word problems Q\&A) High school Open (research dataset)
GPT-4 Physics Q\&A (Camel Physics) 20,000 physics problem–solution pairs generated by GPT-4 (camel-ai/physics · Datasets at Hugging Face) Text (QA, synthetic) Undergrad–Grad (mixed) Open (Hugging Face)
Formal Theorem Libraries Proofs and theorems (Isabelle, Lean, Coq libraries) – e.g. analysis, algebra used in physics Formal text (logic) Graduate–Research Open (MIT/BSD licenses)
Symbolic Integration & ODE Sets Large sets of integrals and differential equations for symbolic solving (e.g. 27M integration pairs) Symbolic (expressions) Undergrad–Grad Open (research, SIRD dataset)
PINN Benchmark (PINNacle) 20+ distinct physics PDEs (heat eq., Navier-Stokes, etc.) with solution data for PINNs (PINNacle: A Comprehensive Benchmark of Physics-Informed Neural …) Numerical (PDE solutions) Undergrad–Grad Open (benchmark dataset)

Multimodal Physics Data

Combining text, equations, and visuals.

The tables above show that many open-access datasets exist, especially for high-energy physics, mathematical problems, and certain simulations. Also note commercial/restricted datasets like proprietary textbook problem banks, paywalled journal corpora, or private experimental data (e.g. active experimental runs not yet released).

Datasets have been used to train a variety of AI models: large language models (using text corpora of physics papers and Q\&A), graph neural networks (using simulation or detector data structured as graphs), symbolic regression models (using formula datasets like Feynman), and physics-informed neural networks (using synthetic PDE solution datasets).

Gap Analysis: Missing or Underrepresented Data

Despite the above resources, I believe that several important gaps remain:

Each of these gaps points to an opportunity for new dataset creation.

Bridging the Data Gap in Theoretical Physics

The datasets reviewed illustrate both the progress made and the potential for advancing theoretical physics. Filling the identified gaps could catalyze breakthroughs. Just as ImageNet revolutionized computer vision, well-crafted physics datasets could similarly drive transformative developments in physics and AI.

I think the task is clear: Physicists and data scientists need to collaborate to create accessible, comprehensive datasets addressing these gaps. Such datasets will not only enhance AI’s capability to understand and predict physics but also foster innovation, potentially accelerating the frontiers of science itself.