Datasets for advancing Theoretical Physics and AI

There is a lack of curated datasets in theoretical physics to train better machine learning models. But what exactly is missing and how can we fill the gaps?

13th April 2025

The history of recent developments in deep learning shows the crucial role played by curated datasets. For example, Fei-Fei Li and her collaborators dramatically reshaped computer vision with the creation of ImageNet, a large-scale, labeled image collection. This sparked the start of the deep learning revolution. Similarly, datasets like CIFAR-10 and MNIST have provided foundational benchmarks essential for algorithmic progress.

Despite these advances in machine learning, theoretical physics still lacks comprehensive, standardized datasets. Developing high-quality datasets specifically tailored for theoretical physics could accelerate progress both in AI—by enabling more powerful models—and in physics itself, by establishing common benchmarks for training and evaluating physics-related models.

In this post, I start by looking to the current existing physics related datasets by domain, data type, level of content and availability. Then I try to identify current existing gaps and propose new dataset creations.

Existing Datasets

Theoretical Physics (Knowledge & Simulations)

This includes textual corpora of theory papers, equation datasets, and simulation data of theoretical models.

Dataset / Source	Domain & Content	Type	Level	Availability
ArXiv Physics Corpus	All physics subfields (theory & experiment) – 1.2M+ research papers (PhysBERT: A Text Embedding Model for Physics Scientific Literature)	Text (papers, PDFs)	Frontier research	Open-access (arXiv)
Physics Journals (e.g. APS)	Broad physics research literature (peer-reviewed journals)	Text (papers)	Frontier research	Restricted (subscription)
Feynman Symbolic Regression Dataset	Classical physics formulas (from Feynman Lectures, etc.) – 100+ laws	Symbolic equations + numeric data	Undergrad–Graduate	Open (research dataset)
Kreuzer–Skarke Calabi–Yau DB	String theory – 473,800,776 reflexive 4D polytopes (Calabi–Yau manifolds) (Group-invariant machine learning on the Kreuzer-Skarke dataset - paywalled version: sciencedirect.com)	Structured (geometric data)	Frontier research	Open (online database)
Lattice QCD Configurations	Quantum Field Theory (lattice QCD) – gauge field samples, correlation functions	Numeric (lattice data)	Frontier research	Partially open (example)
SXS Waveforms (Simulating eXtreme Spacetimes)	General Relativity – Numerical relativity waveforms of binary black holes (SXS Gravitational Waveform Database)	Numerical time-series	Frontier research	Open (public catalog)

Experimental Physics

Datasets from experiments and simulations that test physical theories, often used to train ML models to detect patterns or surrogate models for experiments.

Dataset / Source	Domain & Content	Type	Level	Availability
CERN Open Data (LHC)	High-energy physics – Petabytes of LHC collision data (ATLAS, CMS, etc.)	Numerical (events, detector readings)	Frontier research	Open-access (portal)
HEP ML Datasets (HIGGS, HEPMASS, etc.)	Particle physics – Simulated collision events labeled as Higgs vs. background	Numerical (tabular features)	Graduate/Research	Open (UCI/Zenodo)
LIGO/Virgo GWOSC	Gravitational waves – Time-series signals from interferometers (event strain data)	Numerical (time-series)	Frontier research	Open (GWOSC portal)
Quantum Optics Experiments	Quantum optics – e.g. single-photon interference, trapped-ion measurements	Numeric (experimental logs, time-series)	Graduate/Research	Limited open (lab repositories, e.g. QDataSet)
Fluid Dynamics/CFD Simulations	Classical mechanics – CFD simulation outputs (e.g. flow fields, turbulence)	Numerical (grids, images)	Graduate/Research	Partially open (benchmarks, e.g. NASA CFD data)
Graph Network Simulations	Multi-body physics – Synthetic trajectories for n-body, fluids, rigid bodies (Physics Simulation With Graph Neural Networks Targeting Mobile - Mobile, Graphics, and Gaming blog - Arm Community blogs - Arm Community)	Numeric (graph-based, trajectories)	Undergrad–Graduate	Partially open (code to generate; DeepMind GNS data)

Mathematics for Physics

Datasets of mathematical problems, proofs, and symbolic computations relevant to physics problem-solving and theory.

Dataset / Source	Domain & Content	Type	Level	Availability
MATH Dataset (Hendrycks et al.)	12,500 competition math problems with step-by-step solutions ([2103.03874] Measuring Mathematical Problem Solving With the MATH Dataset)	Text (problem ⇒ solution)	Undergrad (contest)	Open (public dataset)
PhysQA	1,008 physics word problems (mechanics, etc.) with annotated solutions	Text (word problems Q\&A)	High school	Open (original: paperswithcode.com)
GPT-4 Physics Q\&A (Camel Physics)	20,000 physics problem–solution pairs generated by GPT-4 (camel-ai/physics · Datasets at Hugging Face)	Text (QA, synthetic)	Undergrad–Grad (mixed)	Open (Hugging Face)
Formal Theorem Libraries	Proofs and theorems (Isabelle, Lean, Coq libraries) – e.g. analysis, algebra used in physics	Formal text (logic)	Graduate–Research	Open (MIT/BSD licenses)
Symbolic Integration & ODE Sets	Large sets of integrals and differential equations for symbolic solving (e.g. 27M integration pairs)	Symbolic (expressions)	Undergrad–Grad	Open (research, SIRD dataset)
PINN Benchmark (PINNacle)	20+ distinct physics PDEs (heat eq., Navier-Stokes, etc.) with solution data for PINNs (PINNacle: A Comprehensive Benchmark of Physics-Informed Neural …)	Numerical (PDE solutions)	Undergrad–Grad	Open (benchmark dataset)

Multimodal Physics Data

Combining text, equations, and visuals.

MM-PhyQA (Multimodal Physics QA): High-school physics questions each with multiple related images and diagrams (MM-PhyQA: Multimodal Physics Question-Answering With Multi-Image CoT Prompting). Type: Text + images; Level: High school; Availability: Open (research).
Physics StackExchange Q\&A: Community Q\&A with conceptual explanations (text, some diagrams). Type: Text (informal); Level: Undergraduate+; Availability: Open (CC license).
Laboratory Video/Imagery: E.g. cloud chamber images, astronomical images with annotations. Type: Visual + metadata; Level: Graduate; Availability: Partially open (scattered repositories).

The tables above show that many open-access datasets exist, especially for high-energy physics, mathematical problems, and certain simulations. Also note commercial/restricted datasets like proprietary textbook problem banks, paywalled journal corpora, or private experimental data (e.g. active experimental runs not yet released).

Datasets have been used to train a variety of AI models: large language models (using text corpora of physics papers and Q\&A), graph neural networks (using simulation or detector data structured as graphs), symbolic regression models (using formula datasets like Feynman), and physics-informed neural networks (using synthetic PDE solution datasets).

Gap Analysis: Missing or Underrepresented Data

Despite the above resources, I believe that several important gaps remain:

We lack large, well-annotated datasets of physics problems at advanced graduate level, with step-by-step solutions. Existing collections like MATH or PhysQA cover contests or high-school problems, but few cover the multi-step derivations typical in university physics courses (e.g. electromagnetism, quantum mechanics problem sets) with detailed solutions.
There is an absence of curated datasets of theoretical physics knowledge beyond raw text in papers. For example, there is no database of all important equations/derivations in quantum field theory or general relativity with context, proofs, etc. Similarly, while formal math libraries exist, they rarely cover physics-specific theorems or derivations (e.g. proofs of Noether’s theorem, derivations of field equations…).
Niche but important domains like string theory, quantum gravity, or high-dimensional theoretical constructs are underrepresented in accessible data. For instance, the Kreuzer–Skarke dataset (Calabi–Yau spaces) exists but lacks labels connecting to physical phenomenology.
Physics understanding often requires linking equations, diagrams, and natural language. Few datasets integrate multiple modalities – for example, pairing physics textbook figures or experimental plots with explanatory text and underlying equations. The lack of such unified multimodal datasets means AI struggles with tasks like interpreting a diagram alongside text or deriving equations from experimental graphs.
There is a gap in datasets that directly connect experimental data with theoretical predictions in a structured way. While experimental data (like LHC events or LIGO signals) exist, they are not commonly packaged with the corresponding simulated or theoretical model outputs for the same conditions. This makes it difficult for AI to learn how theory parameters influence data and vice versa. A benchmark that pairs raw experimental data with the expected outcomes from theory (or simulation) is largely missing.

Each of these gaps points to an opportunity for new dataset creation.

Bridging the Data Gap in Theoretical Physics

The datasets reviewed illustrate both the progress made and the potential for advancing theoretical physics. Filling the identified gaps could catalyze breakthroughs. Just as ImageNet revolutionized computer vision, well-crafted physics datasets could similarly drive transformative developments in physics and AI.

I think the task is clear: Physicists and data scientists need to collaborate to create accessible, comprehensive datasets addressing these gaps. Such datasets will not only enhance AI’s capability to understand and predict physics but also foster innovation, potentially accelerating the frontiers of science itself.