Launching TheorIA: A Machine-Readable Atlas of Theoretical Physics

If we want AI models to reason about physics, we first need to give them physics they can actually read.

25th May 2025

We are launching TheorIA Dataset (Theoretical Physics Intelligent Anthology), a growing collection of physics equations, step-by-step derivations and plain-language explanations, fully written as self-contained JSON. It fills a gap identified in my earlier post Datasets for advancing Theoretical Physics & AI, namely, the absence of curated, machine-readable data that goes beyond raw PDFs.

We are trying to make something robust and with high-quality: built-in CI validation, explicit assumptions, programmatic proofs (SymPy), and arXiv-style domain tags to keep every entry reproducible and searchable.

We currently have 15 entries, many written with AI, but some already curated by physicists, but we need hundreds. Your favourite derivation is probably still missing.

All code and content is open source under the CC-BY 4.0 license on GitHub. Pull requests are welcome!

Why bother?

ImageNet rewired computer-vision research. In NLP, the Pile, C4 and friends did the same. Theoretical physics, on the other hand, still asks language models to learn Maxwell’s equations by going through paper PDFs.

TheorIA is an attempt to raise the bar:

Pain point TheorIA’s answer
Equations are locked inside prose Each equation is a first-class JSON object, plus symbol table and AsciiMath rendering
Derivations are opaque Straightforward step-by-step derivations with automated verification with SymPy
Reproducibility headaches CI in Github validates all PRs against schema and proofs before merge

TheorIA is a work in progress

Many of the entries you’ll find in TheorIA are currently in draft form, built with the help of AI tooling to bootstrap content at scale. Hence, they often contain typos, notation inconsistencies or even subtle mathematical errors. If they were perfect, this dataset would not be useful for training models.

This is a feature, not a bug: by crowd-sourcing expert review and inviting physicists, mathematicians and educators to correct each derivation, we hope to rapidly turn these drafts into rock-solid reference materials. Your contributions will ensure that TheorIA remains both rigorous and reliable. We will very clearly mark the entries that are still not ready for use.

A quick tour

For now, we have built a simple web viewer to explore the dataset, including each entry, which makes it easy to spot typos.

The main repository is on GitHub, and you can see an example of a raw entry here, the Lorentz transformations. We also have a comprehensive contributing guide.

If you are not a software developer but you want to contribute by correcting or adding an entry, just follow the guidelines, create a json and add it as an issue in the repo. Remember to add your name or/and ORCID to the entry author field!

Roadmap

The project is still early, and it requires significant work to make it useful and meaningful. You can follow the status in the TheorIA project board.

The general steps we have in mind are:

  1. Build some critical mass: Have 100 AI generated entries (I expect them to have many errors, from the ones generated already) and at least 20 curaated by physicists.
  2. Test LLMs performance with the curated examples and compare their output.
  3. Reduce contributors friction: Have an easy way for users to modify or add entries to the dataset, from a user-friendly web interface.
  4. Automate output formats. Provide “one-click” scripts (JSON→LaTeX/Markdown/API) so adopters can plug TheorIA into documentation, teaching materials or model workflows without a learning curve.
  5. Once we have enough entries, deliver a demo. Fine-tune an LLM on the dataset and publicly compare its derivation-explanation quality against baselines.

Call for collaborators

We’re especially looking for:

Final thoughts

I hope that TheorIA will graduate from “neat GitHub repo” to a reference for physicists, educators and AI researchers. Join us in turning raw drafts into high-quality derivations, and let’s build the data foundation that physics and AI have been waiting for.