Skip to main content

Case study

TheoremPath: a graph-backed ML theory learning system.

Last reviewed: May 3, 2026

ML theory is a dependency structure, not a flat list of topics. TheoremPath turns that structure into typed data: a prerequisite graph, claim-level source records, Lean-verified theorem wrappers, diagnostic skill mappings, and browser-native labs. The same objects serve the article, the diagnostic, the source trail, and the proof record.

Proof points

Counts are read from the build artifacts: data/graph/snapshot.json, data/audits/question-claims.json, and data/verification/lean-manifest.json.

Topics in the curriculum graph
600
Prerequisite edges between topics
4,395
Question-bank claim records
16,096
Lean-verified theorem wrappers
37
Recorded sorry / admit markers
0

The 597 published topic pages, 600 graph nodes, and 616 MDX files in the repository each measure a different scope: published-on-site, registered-in-the-graph, and on-disk-including-drafts. The numbers in this strip use the graph-node count because the graph is the object the system is built around.

The six primitives

The site is six primitive object types and the relationships between them. Everything else — pages, diagnostics, evidence badges, study paths — is a view over those primitives.

Topic
A node in the prerequisite graph. Has slug, layer, tier, role, prerequisites, next topics, key theorems.
Claim
An auditable statement on a topic page. Has assumptions, failure modes, source roles, and (sometimes) a Lean mapping.
Source role
What a citation does for the claim — canonical, current, frontier, critique. Not all references count as canonical support.
Diagnostic item
A multiple-choice question linked to the skills it tests. Wrong answers route to specific prerequisite pages, not generic remediation.
Lean mapping
A theorem-declaration name plus the exact proof scope the wrapper verifies. Marked false if the page-level claim is broader than the wrapped statement.
Lab
A browser-native demo with handwritten implementation: induction-heads, optimizer race, attention dynamics, diffusion, factor graphs, others.

System design

Five engineering decisions shape the rest of the system. Each has a corresponding script, schema, or runtime path that anyone can read.

  1. 01

    Content as data

    MDX with strict frontmatter, validated by Zod at build time

    Topic pages are MDX with typed frontmatter (slug, prerequisites, next topics, layer, tier, role, key theorems, must-know-cold). The build refuses any page whose frontmatter does not match the schema, so the graph cannot drift behind the prose.

  2. 02

    Graph at build time

    Single snapshot generated from the MDX corpus

    scripts/build-graph-snapshot.ts walks the topics, extracts prerequisite edges, computes the layer DAG, and writes data/graph/snapshot.json. All prerequisite traversals at runtime read this snapshot — there is no database call on the hot path.

  3. 03

    Diagnostics as data

    Skills, Q-matrix, and gold sets are versioned files

    Diagnostic items live in YAML with an ID, mapped skills, mapped topics, and a stable content hash. The hash is regenerated on commit and a CI check rejects PRs whose hash artifacts drift from source.

  4. 04

    Evidence as records

    Per-claim source roles and Lean mappings

    A claim record carries its statement, assumptions, failure modes, source roles, and (where it exists) a Lean theorem name plus scope. The site renders different evidence badges depending on which combination is present.

  5. 05

    Labs as code

    Pure TypeScript, no tensor library, Web Worker for compute

    Labs implement training loops by hand with finite-difference gradient checks. Heavy work runs in a Web Worker so the main thread stays responsive. No GPU, no Python runtime — the browser is the only environment.

Verification boundary

"Verified" is not one thing. The audit page splits question-bank surfaces into machine-checkable and human-reviewed, and the evidence page splits page-level claims by which kind of evidence supports them. This page surfaces the same split.

Lean-verified

37 wrappers across probability, statistics, learning theory, optimization, linear algebra, calculus.

Each wrapper names an exact theorem declaration and a proof scope. The full page-level claim is not always identical to the wrapped statement — when the gap matters it is marked fullPageClaimVerified: false in the manifest.

Computation-verified

1,192 question-bank surfaces with closed-form answers checked by a small evaluator.

Reserved for compute-type questions where the expected answer is a number or a closed-form expression. Catches arithmetic errors mechanically.

Source-reviewed

Most published topic pages.

References cite specific chapters or papers, with explicit role tags (canonical, current, frontier, critique). Source-reviewed is real review; it is not Lean evidence.

Pending audit annotation

14,904 of 16,096 question-bank surfaces.

These are author-written content with mapped topics and skills, but no separate verification annotation. Most are too lightweight to need one (multiple-choice stems, distractor explanations). The audit page reports the exact split.

See the public audit page for the full breakdown by claim surface, and the evidence page for representative Lean wrappers and the rules they follow.

Tradeoffs

Some things were deliberately not built. Each decision below has a reason; the system is more legible because of what it leaves out.

Why not formalize every theorem in Lean?
Lean evidence is one mode of verification, not the universal standard. Empirical claims about training behavior, historical claims about model releases, and frontier-model claims do not have Lean analogues. Forcing every page through Lean would either fail or weaken the evidence concept.
Why not a database for content?
Public content is files. Static export is fast and inspectable from git history. The database is reserved for user-bound state — quiz history, mastery estimates, review schedules — that genuinely needs writes.
Why graph traversal instead of a recommender?
A learned recommender would optimize for engagement on a corpus where engagement is not the goal. The prerequisite graph is the editorial product; routing follows the graph deterministically and is therefore inspectable.
Why no chatbot on every page?
A generic Ask-AI box that ignores the graph would be table stakes and would also dilute the trust the audit / Lean / source-grounding layer is built to earn. A graph-native tutor that cites topic IDs, claim IDs, and prerequisite paths is a separate, narrower problem worth doing well.

Capabilities the system demonstrates

  • Knowledge-graph data modeling — 600 nodes, 4,395 typed edges, layered DAG, prerequisite traversal API.
  • Content pipelines — Zod-validated MDX, build-time graph snapshot, content-hash artifacts, schema-strict frontmatter.
  • Assessment design — Q-matrix linking skills to items, gold sets with per-set difficulty curves, length-bias gates on choice text.
  • Formal verification integration — Lean theorem wrappers with named declarations and proof-scope metadata, manifest of axioms / sorry / admit counts.
  • Browser-native ML labs — handwritten transformer training, optimizer dynamics, diffusion sampling, attention visualizations, all in TypeScript without a tensor library.
  • Operational hygiene — pre-push hooks that catch audit-graph drift and currency-brace MDX bugs, CI gates on lint / typecheck / test, nightly digest of merged PRs and stale artifacts.

What is queued next

The static-content base is largely settled. The next three phases turn the graph and the diagnostics into an adaptive loop instead of a passive atlas.

  1. Phase 1

    Graph-native tutor over existing content

    Add /api/tutor/message and a tutor drawer on topic pages. Constrain answers to topic IDs, claim IDs, and prerequisite paths. Diagnose-then-explain rather than chat-then-pray.

  2. Phase 2

    Per-learner skill state and routing

    Mirror the graph and diagnostic data into a queryable store, attach per-user mastery estimates and review schedules, and route study suggestions from the same data the tutor uses.

  3. Phase 3

    Companion app for habit-grade review

    Daily diagnostic, review cards, target-path progress, offline reading queue. The phone is the right surface for daily-practice; the web stays the canonical content surface.

Read on