Formal Verification: Proof Assistants and Verified Systems

Why This Matters

Testing can show the presence of bugs but never their absence. For safety-critical systems (aircraft control, medical devices, cryptographic protocols, financial infrastructure), "it worked on the test cases" is not sufficient. Formal verification provides mathematical proof that a system satisfies its specification for ALL possible inputs, not just the ones you tested.

The connection to ML: as AI systems are deployed in high-stakes domains, the question shifts from "does it work on the benchmark?" to "can we prove it satisfies certain properties?" Verified neural network properties (robustness certificates, output bounds), formal specifications for RL agents, and machine-checked proofs of mathematical theorems are active research areas where formal methods intersect with ML.

Proof assistants (Lean, Coq, Isabelle) also matter for mathematics itself. The Kepler conjecture was proved with computer-assisted verification (Hales, Flyspeck project). Large language models are now being trained to generate formal proofs, with Lean 4 as the primary target language. Understanding what formal verification can and cannot do is necessary context for this line of research.

Prerequisites

This page assumes familiarity with basic logic and proof techniques (propositional logic, predicate logic, induction) and type theory (types as propositions, the Curry-Howard correspondence).

Two Paradigms: Model Checking and Theorem Proving

Definition

Model Checking

Model checking is an automated verification technique that exhaustively explores the state space of a finite-state system to check whether a property (expressed in temporal logic) holds. Given a model $M$ (a finite-state transition system) and a specification $\phi$ (a temporal logic formula), the model checker determines whether $M \models \phi$ (the model satisfies the specification).

The inputs are:

A model: a finite-state transition system $M = (S, S_0, T, L)$ with states $S$ , initial states $S_0$ , transition relation $T \subseteq S \times S$ , and labeling function $L: S \to 2^{AP}$ mapping states to sets of atomic propositions
A specification: a formula in temporal logic (CTL, LTL, CTL*)

The output is either "yes" (with a proof) or "no" (with a counterexample trace).

Definition

Theorem Proving (Interactive)

Interactive theorem proving uses a proof assistant (Lean, Coq, Isabelle/HOL) to construct machine-checked proofs of mathematical statements. The human provides the proof strategy; the proof assistant checks every logical step. The result is a formal proof object that can be independently verified by a small trusted kernel.

Unlike model checking, theorem proving handles infinite-state systems and arbitrary mathematical statements. The cost: proofs require significant human effort (or, increasingly, AI assistance).

Definition

Temporal Logic (CTL and LTL)

Temporal logic extends propositional logic with operators that express properties over time:

LTL (Linear Temporal Logic): reasons about single execution paths.
- $\square p$ : "always $p$ " (safety: something bad never happens)
- $\lozenge p$ : "eventually $p$ " (liveness: something good eventually happens)
- $p \; \mathcal{U} \; q$ : " $p$ holds until $q$ holds"
CTL (Computation Tree Logic): reasons about branching computation trees.
- $AG \, p$ : "for all paths, always $p$ "
- $EF \, p$ : "there exists a path where eventually $p$ "
- $AF \, p$ : "for all paths, eventually $p$ "

LTL specifications express properties of traces. CTL specifications express properties of the full computation tree.

The Curry-Howard Correspondence

The bridge between logic and programming. Under the Curry-Howard correspondence:

Logic	Programming
Propositions	Types
Proofs	Programs (terms)
Implication $A \Rightarrow B$	Function type $A \to B$
Conjunction $A \wedge B$	Product type $A \times B$
Disjunction $A \vee B$	Sum type $A + B$
Universal $\forall x. P(x)$	Dependent function $\Pi_{x:A} P(x)$
Existential $\exists x. P(x)$	Dependent pair $\Sigma_{x:A} P(x)$

A proof of $A \Rightarrow B$ is a function that takes a proof of $A$ and produces a proof of $B$ . This is not a metaphor: in Lean 4, the statement theorem foo : P -> Q declares a function from proofs of $P$ to proofs of $Q$ . Writing the function body IS the proof.

This correspondence is the foundation of all modern proof assistants. It means that type checking (which compilers do automatically) is proof checking. A well-typed program in a dependently typed language is a verified proof.

Main Theorems

Theorem

Rice's Theorem (Verification is Generally Undecidable)

Statement

Let $P$ be any nontrivial semantic property of programs (i.e., a property of the function computed by the program, not of the program text). Then it is undecidable whether an arbitrary program satisfies $P$ .

Formally: let $\mathcal{P}$ be a nontrivial set of partial recursive functions. Then the set $\{e : \phi_e \in \mathcal{P}\}$ is not recursive, where $\phi_e$ is the function computed by program $e$ .

Intuition

If you could decide any nontrivial property of programs, you could in particular decide whether a program halts (by checking the property "computes a total function"). Since the halting problem is undecidable, no nontrivial property can be decidable in general.

This does not mean verification is impossible. It means there is no single algorithm that correctly verifies ALL programs. For specific programs with specific structures (finite-state systems, programs with bounded loops, programs that terminate by construction), verification is often feasible.

Proof Sketch

By reduction from the halting problem. Given a Turing machine $M$ and input $w$ , construct a program $e$ such that $\phi_e \in \mathcal{P}$ if and only if $M$ halts on $w$ . Specifically: let $f \in \mathcal{P}$ and $g \notin \mathcal{P}$ . Define $e$ to simulate $M$ on $w$ ; if $M$ halts, $e$ computes $f$ ; otherwise, $e$ computes $g$ (or diverges). If we could decide membership in $\mathcal{P}$ , we could decide the halting problem. Contradiction.

Why It Matters

Rice's theorem sets the fundamental limits of verification. It explains why we need both automated tools (model checking for finite-state systems) and human-guided proof (theorem proving for infinite-state systems). It also explains why verification tools use abstractions, approximations, and restricted program models: they sidestep the undecidability by working with subsets of programs where the property IS decidable.

For ML: verifying that a neural network satisfies a property (e.g., robustness to perturbations) is undecidable in general for ReLU networks with arbitrary architectures. Practical verifiers work on specific architectures with bounded depth and width.

Failure Mode

The theorem applies to Turing-complete languages. For restricted computational models (finite automata, pushdown automata, bounded programs), many properties ARE decidable. Model checking exploits this: by restricting to finite-state systems, exhaustive state exploration becomes possible (though potentially expensive).

report a correction →

Theorem

CTL Model Checking Complexity

Statement

The CTL model checking problem (given $M$ and $\phi$ , does $M \models \phi$ ?) can be solved in time $O(n \cdot (|S| + |T|))$ . That is, model checking is linear in the size of the specification and linear in the size of the model.

For LTL model checking, the complexity is $O(|S| \cdot 2^{|\phi|})$ : linear in the model size but exponential in the specification length.

Intuition

CTL model checking works by labeling states bottom-up through the formula structure. For each subformula, you mark which states satisfy it. The fixpoint computations (for operators like $AG$ and $EU$ ) require traversing the transition relation, taking $O(|S| + |T|)$ per subformula. The total work is proportional to the number of subformulas times the model size.

LTL model checking is harder because LTL specifications can express properties that require tracking the full execution path, not just the current state. The specification is compiled into a Buchi automaton (potentially exponentially larger), and the model is composed with this automaton.

Proof Sketch

For CTL: by structural induction on $\phi$ . Atomic propositions are checked in $O(|S|)$ by the labeling function. Boolean connectives are checked in $O(|S|)$ . The temporal operators $EX$ , $EU$ , and $EG$ require fixpoint computation. $EX\phi$ labels all states with a successor satisfying $\phi$ : one pass over $T$ , costing $O(|T|)$ . $E[\phi_1 \, U \, \phi_2]$ uses a backward reachability computation from states satisfying $\phi_2$ , costing $O(|S| + |T|)$ . $EG\phi$ uses a greatest fixpoint computation on the subgraph of states satisfying $\phi$ , also $O(|S| + |T|)$ .

The full proof is in Clarke, Grumberg, and Peled, Model Checking, Chapter 4.

Why It Matters

The linear complexity (in the model) is what makes model checking practical. The exponential part is in the specification size, which is typically small (tens of operators). The model can have billions of states and still be checkable, especially with symbolic model checking (using BDDs or SAT solvers to represent state sets compactly).

This result underpins the practical success of model checking in hardware verification (Intel uses model checking for chip design), protocol verification (TLS, consensus algorithms), and software verification (SLAM, CEGAR).

Failure Mode

The state explosion problem: the model size $|S|$ can be exponential in the number of system components. A system with $n$ boolean variables has $2^n$ states. Concurrent systems with $k$ processes, each with $m$ states, have $m^k$ states. Symbolic methods (BDDs, SAT-based BMC) and abstraction (CEGAR) mitigate but do not eliminate this problem.

report a correction →

Proof Assistants in Practice

Lean 4

Lean 4 is both a programming language and a proof assistant. Its type system is based on the Calculus of Inductive Constructions (CIC). Key features:

Dependent types: types can depend on values, enabling precise specifications
Tactic mode: proofs can be written as sequences of proof tactics (automated proof steps)
Metaprogramming: tactics are written in Lean itself, making the system extensible
Mathlib: a large mathematical library (hundreds of thousands of theorems)

Lean is the primary target for AI-assisted theorem proving — see the AlphaProof section below.

Coq

Coq is the oldest widely used proof assistant based on CIC. CompCert (a verified C compiler) and CertiKOS (a verified OS kernel) were developed in Coq. Its tactic language (Ltac) is mature but less ergonomic than Lean 4's.

Isabelle/HOL

Isabelle uses higher-order logic (simpler than CIC) and has strong automation. The seL4 verified microkernel was developed in Isabelle.

Verified Systems

CompCert (Leroy, 2006): a C compiler verified in Coq to preserve the semantics of every compiled program. For any source program $P$ and compiled program $C$ : if $P$ has well-defined behavior, then $C$ has the same observable behavior. The proof covers all optimization passes. This means compiler bugs cannot cause verified-source programs to misbehave.

seL4 (Klein et al., 2009): a microkernel verified in Isabelle/HOL. The proof establishes that the C implementation correctly implements the abstract specification, which includes memory isolation, capability-based access control, and interrupt handling. The verification took approximately 20 person-years.

Flyspeck (Hales et al., 2017): a formal proof of the Kepler conjecture (optimal sphere packing in 3D). The original proof by Hales (1998) relied on extensive computation that referees could not fully check. The Flyspeck project formalized the entire proof in HOL Light, producing a machine-verified certificate.

SMT Solvers, Bounded Model Checking, and CEGAR

Two infrastructure pieces sit underneath both verifiers and modern proof assistants.

SMT solvers decide quantifier-free formulas over combined theories (booleans, linear arithmetic, bitvectors, arrays, uninterpreted functions). The dominant tools are Z3 (de Moura & Bjørner 2008) and CVC5 (Barbosa et al. 2022). They underpin Boogie/Dafny, the Lean decide tactic, symbolic execution engines (KLEE, Manticore), and most program-analysis tooling.

Bounded model checking (BMC), due to Biere, Cimatti, Clarke & Zhu (1999), unrolls a transition system to depth $k$ and asks a SAT/SMT solver whether any counterexample of length $\leq k$ exists. BMC sidesteps the state-explosion of explicit-state model checking and is the workhorse of hardware bug-hunting. It is bug-finding rather than full verification unless the diameter of the state graph is known.

CEGAR (Counter-Example Guided Abstraction Refinement; Clarke, Grumberg, Jha, Lu & Veith JACM 2003) iterates: model-check an abstraction, get either a proof or a candidate counterexample, replay the counterexample on the concrete system. If real, report it; if spurious, refine the abstraction by adding the predicates needed to rule it out. CEGAR drives SLAM, BLAST, and many software model checkers.

ML for Theorem Proving — AlphaProof and the 2024 Frontier

The most consequential 2024 development bridging formal verification and ML is DeepMind's AlphaProof, which achieved silver-medal performance on the International Mathematical Olympiad (IMO) 2024 by generating Lean 4 proofs for olympiad problems. AlphaProof combines a Gemini-based informal reasoner that translates natural-language problems into Lean conjectures with a reinforcement-learning prover that searches for proofs against the Lean kernel. Combined with AlphaGeometry 2 for geometry problems, the system solved 4 of 6 IMO problems — within one point of gold.

This sits on a research stack that matured rapidly between 2020 and 2024:

GPT-f (Polu & Sutskever 2020):first generative LM trained to produce Metamath proofs.
PACT (Han et al. 2022):co-training on proof-state and proof-step prediction in Lean.
HyperTree Proof Search (Lample et al. 2022):Meta's MCTS-style search for theorem proving.
LeanDojo (Yang et al. NeurIPS 2023):open-source framework giving programmatic access to Lean and a retrieval-augmented baseline (ReProver). The training/eval substrate for most subsequent academic work.
AlphaGeometry (Trinh, Wu, Le, He & Luong, Nature 2024):neuro-symbolic system that solved 25 of 30 IMO geometry problems by combining a symbolic deduction engine with a transformer that proposes auxiliary constructions.
AlphaProof (DeepMind 2024):IMO 2024 silver via Lean 4 + RL.

Standard benchmarks: miniF2F (Zheng, Han & Polu, ICLR 2022) and PutnamBench (Tsoukalas et al. 2024) measure LM theorem-proving on olympiad and Putnam problems. Progress on these benchmarks went from < 30% to > 75% on miniF2F-test between 2021 and 2024.

The foundational interpretation: under Curry-Howard, training an LM to produce Lean proofs is training it to produce well-typed terms in a dependently typed language. Every accepted output is a machine-checked theorem — there is no hallucination problem at the proof level, only at the conjecture level. This is the closest the field has come to a "verifier-in-the-loop" training signal for mathematical reasoning.

Connections to ML

Verified neural network properties. Given a neural network $f$ and an input region $X$ , verify that $f(x) \in Y$ for all $x \in X$ . For example: prove that a classifier's output does not change for any perturbation within an $\ell_\infty$ ball (adversarial robustness). Tools like alpha-beta-CROWN and MN-BaB use bound propagation and branch-and-bound to verify ReLU networks. These are sound (a verified property truly holds) but incomplete (they may fail to verify a property that holds).

Autoformalization. Use language models to translate informal mathematical proofs into formal proofs in Lean or Coq. This is an active research area: if successful, it would dramatically accelerate formalization of mathematics and enable AI systems to produce verifiable mathematical reasoning.

Agent specifications. For an RL agent deployed in a safety-critical setting, specify properties like "the agent never takes action X in state Y" as temporal logic formulas and verify them against the agent's policy (if the policy is a finite-state controller) or check them empirically with formal guarantees (if the policy is a neural network with verified bounds).

Common Confusions

Watch Out

Formal verification is not testing

Testing checks specific inputs. Verification proves a property for all inputs (or all reachable states). A test suite with 99.9% coverage can miss the one edge case that causes a disaster. Verification guarantees no such edge case exists, relative to the formal specification.

Watch Out

Verification does not mean the system is correct in every sense

Verification proves that the implementation matches the specification. If the specification is wrong, the system is verified but still does the wrong thing. The specification problem (formally capturing what you actually want) is the formal-methods analog of the reward design problem in RL.

Watch Out

Rice's theorem does not make verification useless

Rice's theorem says you cannot verify arbitrary programs for arbitrary properties. It does not say you cannot verify specific programs for specific properties. Model checking works for finite-state systems. Theorem proving works for programs where humans (or AI) can construct proofs. Abstraction and approximation handle many intermediate cases. The practical question is always: can we verify THIS system for THIS property at acceptable cost?

Watch Out

Proof assistants do not generate proofs automatically

Proof assistants check proofs, not find them. The human (or AI) provides the proof structure, and the assistant verifies that every step is logically valid. Some automation exists (tactic solvers, SMT integration), but for nontrivial theorems, significant human guidance is required. The gap between "the theorem is true" and "we have a machine-checked proof" can be years of effort.

Summary

Model checking exhaustively explores finite-state systems; CTL checking is linear in model size, exponential in specification size
Theorem proving constructs machine-checked proofs for arbitrary mathematical statements, including properties of infinite-state systems
The Curry-Howard correspondence equates proofs with programs, enabling proof assistants based on type theory
Rice's theorem sets fundamental limits: general verification is undecidable, but specific verification is often feasible
CompCert (verified compiler) and seL4 (verified microkernel) are landmark verified systems
Neural network verification, autoformalization, and agent specification are active ML-formal methods intersections
Verification proves implementation matches specification; it does not prove the specification is correct

Exercises

ExerciseCore

Problem

A traffic light controller has states $\{R, Y, G\}$ (red, yellow, green) for two directions (North-South and East-West), giving $9$ combined states. Write a CTL safety property stating that both directions are never green simultaneously. How many states does the model checker need to explore?

ExerciseCore

Problem

In Lean 4, the statement theorem add_comm (n m : Nat) : n + m = m + n declares that natural number addition is commutative. Explain, using the Curry-Howard correspondence, what this declaration asks you to produce.

ExerciseAdvanced

Problem

A ReLU neural network $f: \mathbb{R}^d \to \mathbb{R}$ defines a piecewise linear function. For a given input $x_0$ and perturbation radius $\epsilon$ , the robustness verification problem asks: is $f(x) > 0$ for all $x$ with $\|x - x_0\|_\infty \leq \epsilon$ ? Explain why this is NP-hard in general and describe the approach used by bound-propagation verifiers.

References

Canonical:

Clarke, Grumberg, Peled, Model Checking (2nd ed., 2018), Chapters 1-4
Baier & Katoen, Principles of Model Checking (2008), Chapters 5-6
Pierce, Software Foundations (online textbook for Coq-based verification)

Verified Systems:

Leroy, "Formal Verification of a Realistic Compiler" (CompCert, CACM 2009)
Klein et al., "seL4: Formal Verification of an OS Kernel" (SOSP 2009)

ML Connections, Theorem Proving:

Polu & Sutskever, "Generative Language Modeling for Automated Theorem Proving" (2020):GPT-f
Zheng, Han & Polu, "miniF2F: a cross-system benchmark for formal Olympiad-level mathematics" (ICLR 2022)
Yang et al., "LeanDojo: Theorem Proving with Retrieval-Augmented Language Models" (NeurIPS 2023)
Trinh, Wu, Le, He & Luong, "Solving olympiad geometry without human demonstrations" (Nature 2024):AlphaGeometry
DeepMind AlphaProof / AlphaGeometry team, "AI achieves silver-medal standard solving International Mathematical Olympiad problems" (2024)
First, Rabe, Bansal, "Baldur: Whole-Proof Generation and Repair with Large Language Models" (2023)

ML Connections, Network Verification:

Katz et al., "Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks" (CAV 2017)
Wang et al., "Beta-CROWN: Efficient Bound Propagation with Per-neuron Split Constraints for Complete and Incomplete Neural Network Verification" (NeurIPS 2021)

SMT and Model Checking Infrastructure:

Biere, Cimatti, Clarke & Zhu, "Symbolic Model Checking without BDDs" (1999):bounded model checking
Clarke, Grumberg, Jha, Lu & Veith, "Counterexample-guided abstraction refinement for symbolic model checking" (JACM 2003):CEGAR
de Moura & Bjørner, "Z3: An Efficient SMT Solver" (TACAS 2008)

Next Topics

Computability theory: the theoretical foundations underlying Rice's theorem and the limits of automated reasoning