There is a lack of curated datasets in theoretical physics to train better machine learning models. But what exactly is missing and how can we fill the gaps?
13th April 2025
The history of recent developments in deep learning shows the crucial role played by curated datasets. For example, Fei-Fei Li and her collaborators dramatically reshaped computer vision with the creation of ImageNet, a large-scale, labeled image collection. This sparked the start of the deep learning revolution. Similarly, datasets like CIFAR-10 and MNIST have provided foundational benchmarks essential for algorithmic progress.
Despite these advances in machine learning, theoretical physics still lacks comprehensive, standardized datasets. Developing high-quality datasets specifically tailored for theoretical physics could accelerate progress both in AI—by enabling more powerful models—and in physics itself, by establishing common benchmarks for training and evaluating physics-related models.
In this post, I start by looking to the current existing physics related datasets by domain, data type, level of content and availability. Then I try to identify current existing gaps and propose new dataset creations.
This includes textual corpora of theory papers, equation datasets, and simulation data of theoretical models.
Dataset / Source | Domain & Content | Type | Level | Availability |
---|---|---|---|---|
ArXiv Physics Corpus | All physics subfields (theory & experiment) – 1.2M+ research papers (PhysBERT: A Text Embedding Model for Physics Scientific Literature) | Text (papers, PDFs) | Frontier research | Open-access (arXiv) |
Physics Journals (e.g. APS) | Broad physics research literature (peer-reviewed journals) | Text (papers) | Frontier research | Restricted (subscription) |
Feynman Symbolic Regression Dataset | Classical physics formulas (from Feynman Lectures, etc.) – 100+ laws | Symbolic equations + numeric data | Undergrad–Graduate | Open (research dataset) |
Kreuzer–Skarke Calabi–Yau DB | String theory – 473,800,776 reflexive 4D polytopes (Calabi–Yau manifolds) (Group-invariant machine learning on the Kreuzer-Skarke dataset) | Structured (geometric data) | Frontier research | Open (online database) |
Lattice QCD Configurations | Quantum Field Theory (lattice QCD) – gauge field samples, correlation functions | Numeric (lattice data) | Frontier research | Partially open (example) |
SXS Waveforms (Simulating eXtreme Spacetimes) | General Relativity – Numerical relativity waveforms of binary black holes (SXS Gravitational Waveform Database) | Numerical time-series | Frontier research | Open (public catalog) |
Datasets from experiments and simulations that test physical theories, often used to train ML models to detect patterns or surrogate models for experiments.
Dataset / Source | Domain & Content | Type | Level | Availability |
---|---|---|---|---|
CERN Open Data (LHC) | High-energy physics – Petabytes of LHC collision data (ATLAS, CMS, etc.) | Numerical (events, detector readings) | Frontier research | Open-access (portal) |
HEP ML Datasets (HIGGS, HEPMASS, etc.) | Particle physics – Simulated collision events labeled as Higgs vs. background | Numerical (tabular features) | Graduate/Research | Open (UCI/Zenodo) |
LIGO/Virgo GWOSC | Gravitational waves – Time-series signals from interferometers (event strain data) | Numerical (time-series) | Frontier research | Open (GWOSC portal) |
Quantum Optics Experiments | Quantum optics – e.g. single-photon interference, trapped-ion measurements | Numeric (experimental logs, time-series) | Graduate/Research | Limited open (lab repositories, e.g. QDataSet) |
Fluid Dynamics/CFD Simulations | Classical mechanics – CFD simulation outputs (e.g. flow fields, turbulence) | Numerical (grids, images) | Graduate/Research | Partially open (benchmarks, e.g. NASA CFD data) |
Graph Network Simulations | Multi-body physics – Synthetic trajectories for n-body, fluids, rigid bodies (Physics Simulation With Graph Neural Networks Targeting Mobile - Mobile, Graphics, and Gaming blog - Arm Community blogs - Arm Community) | Numeric (graph-based, trajectories) | Undergrad–Graduate | Partially open (code to generate; DeepMind GNS data) |
Datasets of mathematical problems, proofs, and symbolic computations relevant to physics problem-solving and theory.
Dataset / Source | Domain & Content | Type | Level | Availability |
---|---|---|---|---|
MATH Dataset (Hendrycks et al.) | 12,500 competition math problems with step-by-step solutions ([2103.03874] Measuring Mathematical Problem Solving With the MATH Dataset) | Text (problem ⇒ solution) | Undergrad (contest) | Open (public dataset) |
PhysQA | 1,008 physics word problems (mechanics, etc.) with annotated solutions | Text (word problems Q\&A) | High school | Open (research dataset) |
GPT-4 Physics Q\&A (Camel Physics) | 20,000 physics problem–solution pairs generated by GPT-4 (camel-ai/physics · Datasets at Hugging Face) | Text (QA, synthetic) | Undergrad–Grad (mixed) | Open (Hugging Face) |
Formal Theorem Libraries | Proofs and theorems (Isabelle, Lean, Coq libraries) – e.g. analysis, algebra used in physics | Formal text (logic) | Graduate–Research | Open (MIT/BSD licenses) |
Symbolic Integration & ODE Sets | Large sets of integrals and differential equations for symbolic solving (e.g. 27M integration pairs) | Symbolic (expressions) | Undergrad–Grad | Open (research, SIRD dataset) |
PINN Benchmark (PINNacle) | 20+ distinct physics PDEs (heat eq., Navier-Stokes, etc.) with solution data for PINNs (PINNacle: A Comprehensive Benchmark of Physics-Informed Neural …) | Numerical (PDE solutions) | Undergrad–Grad | Open (benchmark dataset) |
Combining text, equations, and visuals.
MM-PhyQA (Multimodal Physics QA): High-school physics questions each with multiple related images and diagrams (MM-PhyQA: Multimodal Physics Question-Answering With Multi-Image CoT Prompting). Type: Text + images; Level: High school; Availability: Open (research).
Physics StackExchange Q\&A: Community Q\&A with conceptual explanations (text, some diagrams). Type: Text (informal); Level: Undergraduate+; Availability: Open (CC license).
Laboratory Video/Imagery: E.g. cloud chamber images, astronomical images with annotations. Type: Visual + metadata; Level: Graduate; Availability: Partially open (scattered repositories).
The tables above show that many open-access datasets exist, especially for high-energy physics, mathematical problems, and certain simulations. Also note commercial/restricted datasets like proprietary textbook problem banks, paywalled journal corpora, or private experimental data (e.g. active experimental runs not yet released).
Datasets have been used to train a variety of AI models: large language models (using text corpora of physics papers and Q\&A), graph neural networks (using simulation or detector data structured as graphs), symbolic regression models (using formula datasets like Feynman), and physics-informed neural networks (using synthetic PDE solution datasets).
Despite the above resources, I believe that several important gaps remain:
Each of these gaps points to an opportunity for new dataset creation.
The datasets reviewed illustrate both the progress made and the potential for advancing theoretical physics. Filling the identified gaps could catalyze breakthroughs. Just as ImageNet revolutionized computer vision, well-crafted physics datasets could similarly drive transformative developments in physics and AI.
I think the task is clear: Physicists and data scientists need to collaborate to create accessible, comprehensive datasets addressing these gaps. Such datasets will not only enhance AI’s capability to understand and predict physics but also foster innovation, potentially accelerating the frontiers of science itself.