If we want AI models to reason about physics, we first need to give them physics they can actually read.
25th May 2025
We are launching TheorIA Dataset (Theoretical Physics Intelligent Anthology), a growing collection of physics equations, step-by-step derivations and plain-language explanations, fully written as self-contained JSON. It fills a gap identified in my earlier post Datasets for advancing Theoretical Physics & AI, namely, the absence of curated, machine-readable data that goes beyond raw PDFs.
We are trying to make something robust and with high-quality: built-in CI validation, explicit assumptions, programmatic proofs (SymPy), and arXiv-style domain tags to keep every entry reproducible and searchable.
We currently have 15 entries, many written with AI, but some already curated by physicists, but we need hundreds. Your favourite derivation is probably still missing.
All code and content is open source under the CC-BY 4.0 license on GitHub. Pull requests are welcome!
ImageNet rewired computer-vision research. In NLP, the Pile, C4 and friends did the same. Theoretical physics, on the other hand, still asks language models to learn Maxwell’s equations by going through paper PDFs.
TheorIA is an attempt to raise the bar:
Pain point | TheorIA’s answer |
---|---|
Equations are locked inside prose | Each equation is a first-class JSON object, plus symbol table and AsciiMath rendering |
Derivations are opaque | Straightforward step-by-step derivations with automated verification with SymPy |
Reproducibility headaches | CI in Github validates all PRs against schema and proofs before merge |
Many of the entries you’ll find in TheorIA are currently in draft form, built with the help of AI tooling to bootstrap content at scale. Hence, they often contain typos, notation inconsistencies or even subtle mathematical errors. If they were perfect, this dataset would not be useful for training models.
This is a feature, not a bug: by crowd-sourcing expert review and inviting physicists, mathematicians and educators to correct each derivation, we hope to rapidly turn these drafts into rock-solid reference materials. Your contributions will ensure that TheorIA remains both rigorous and reliable. We will very clearly mark the entries that are still not ready for use.
For now, we have built a simple web viewer to explore the dataset, including each entry, which makes it easy to spot typos.
The main repository is on GitHub, and you can see an example of a raw entry here, the Lorentz transformations. We also have a comprehensive contributing guide.
If you are not a software developer but you want to contribute by correcting or adding an entry, just follow the guidelines, create a json and add it as an issue in the repo. Remember to add your name or/and ORCID to the entry author field!
The project is still early, and it requires significant work to make it useful and meaningful. You can follow the status in the TheorIA project board.
The general steps we have in mind are:
We’re especially looking for:
I hope that TheorIA will graduate from “neat GitHub repo” to a reference for physicists, educators and AI researchers. Join us in turning raw drafts into high-quality derivations, and let’s build the data foundation that physics and AI have been waiting for.