Foundations

Formal Verification and Proof Assistants

Some behaviors should be proven, not merely hoped for. Model checking, theorem proving, Lean/Coq, verified compilers, and why verification matters for infrastructure, protocols, and safety-critical AI.

AdvancedTier 2Current~40 min

Prerequisites

Basic Logic and Proof Techniques Type Theory

Why This Matters

Testing can show the presence of bugs but never their absence. For safety-critical systems (aircraft control, medical devices, cryptographic protocols, financial infrastructure), "it worked on the test cases" is not sufficient. Formal verification provides mathematical proof that a system satisfies its specification for ALL possible inputs, not just the ones you tested.

The connection to ML: as AI systems are deployed in high-stakes domains, the question shifts from "does it work on the benchmark?" to "can we prove it satisfies certain properties?" Verified neural network properties (robustness certificates, output bounds), formal specifications for RL agents, and machine-checked proofs of mathematical theorems are active research areas where formal methods intersect with ML.

Proof assistants (Lean, Coq, Isabelle) also matter for mathematics itself. The Kepler conjecture was proved with computer-assisted verification (Hales, Flyspeck project). Large language models are now being trained to generate formal proofs, with Lean 4 as the primary target language. Understanding what formal verification can and cannot do is necessary context for this line of research.

Prerequisites

This page assumes familiarity with basic logic and proof techniques (propositional logic, predicate logic, induction) and type theory (types as propositions, the Curry-Howard correspondence).

Two Paradigms: Model Checking and Theorem Proving

Definition

Model Checking

Model checking is an automated verification technique that exhaustively explores the state space of a finite-state system to check whether a property (expressed in temporal logic) holds. Given a model $M$ (a finite-state transition system) and a specification $\phi$ (a temporal logic formula), the model checker determines whether $M \models \phi$ (the model satisfies the specification).

The inputs are:

A model: a finite-state transition system $M = (S, S_0, T, L)$ with states $S$ , initial states $S_0$ , transition relation $T \subseteq S \times S$ , and labeling function $L: S \to 2^{AP}$ mapping states to sets of atomic propositions
A specification: a formula in temporal logic (CTL, LTL, CTL*)

The output is either "yes" (with a proof) or "no" (with a counterexample trace).

Definition

Theorem Proving (Interactive)

Interactive theorem proving uses a proof assistant (Lean, Coq, Isabelle/HOL) to construct machine-checked proofs of mathematical statements. The human provides the proof strategy; the proof assistant checks every logical step. The result is a formal proof object that can be independently verified by a small trusted kernel.

Unlike model checking, theorem proving handles infinite-state systems and arbitrary mathematical statements. The cost: proofs require significant human effort (or, increasingly, AI assistance).

Definition

Temporal Logic (CTL and LTL)

Temporal logic extends propositional logic with operators that express properties over time:

LTL (Linear Temporal Logic): reasons about single execution paths.
- $\square p$ : "always $p$ " (safety: something bad never happens)
- $\lozenge p$ : "eventually $p$ " (liveness: something good eventually happens)
- $p \; \mathcal{U} \; q$ : " $p$ holds until $q$ holds"
CTL (Computation Tree Logic): reasons about branching computation trees.
- $AG \, p$ : "for all paths, always $p$ "
- $EF \, p$ : "there exists a path where eventually $p$ "
- $AF \, p$ : "for all paths, eventually $p$ "

LTL specifications express properties of traces. CTL specifications express properties of the full computation tree.

The Curry-Howard Correspondence

The bridge between logic and programming. Under the Curry-Howard correspondence:

Logic	Programming
Propositions	Types
Proofs	Programs (terms)
Implication $A \Rightarrow B$	Function type $A \to B$
Conjunction $A \wedge B$	Product type $A \times B$
Disjunction $A \vee B$	Sum type $A + B$
Universal $\forall x. P(x)$	Dependent function $\Pi_{x:A} P(x)$
Existential $\exists x. P(x)$	Dependent pair $\Sigma_{x:A} P(x)$

A proof of $A \Rightarrow B$ is a function that takes a proof of $A$ and produces a proof of $B$ . This is not a metaphor: in Lean 4, the statement theorem foo : P -> Q declares a function from proofs of $P$ to proofs of $Q$ . Writing the function body IS the proof.

This correspondence is the foundation of all modern proof assistants. It means that type checking (which compilers do automatically) is proof checking. A well-typed program in a dependently typed language is a verified proof.

Main Theorems

Theorem

Rice's Theorem (Verification is Generally Undecidable)

Statement

Let $P$ be any nontrivial semantic property of programs (i.e., a property of the function computed by the program, not of the program text). Then it is undecidable whether an arbitrary program satisfies $P$ .

Formally: let $\mathcal{P}$ be a nontrivial set of partial recursive functions. Then the set $\{e : \phi_e \in \mathcal{P}\}$ is not recursive, where $\phi_e$ is the function computed by program $e$ .

Intuition

If you could decide any nontrivial property of programs, you could in particular decide whether a program halts (by checking the property "computes a total function"). Since the halting problem is undecidable, no nontrivial property can be decidable in general.

This does not mean verification is impossible. It means there is no single algorithm that correctly verifies ALL programs. For specific programs with specific structures (finite-state systems, programs with bounded loops, programs that terminate by construction), verification is often feasible.

Proof Sketch

By reduction from the halting problem. Given a Turing machine $M$ and input $w$ , construct a program $e$ such that $\phi_e \in \mathcal{P}$ if and only if $M$ halts on $w$ . Specifically: let $f \in \mathcal{P}$ and $g \notin \mathcal{P}$ . Define $e$ to simulate $M$ on $w$ ; if $M$ halts, $e$ computes $f$ ; otherwise, $e$ computes $g$ (or diverges). If we could decide membership in $\mathcal{P}$ , we could decide the halting problem. Contradiction.

Why It Matters

Rice's theorem sets the fundamental limits of verification. It explains why we need both automated tools (model checking for finite-state systems) and human-guided proof (theorem proving for infinite-state systems). It also explains why verification tools use abstractions, approximations, and restricted program models: they sidestep the undecidability by working with subsets of programs where the property IS decidable.

For ML: verifying that a neural network satisfies a property (e.g., robustness to perturbations) is undecidable in general for ReLU networks with arbitrary architectures. Practical verifiers work on specific architectures with bounded depth and width.

Failure Mode

The theorem applies to Turing-complete languages. For restricted computational models (finite automata, pushdown automata, bounded programs), many properties ARE decidable. Model checking exploits this: by restricting to finite-state systems, exhaustive state exploration becomes possible (though potentially expensive).

Theorem

CTL Model Checking Complexity

Statement

The CTL model checking problem (given $M$ and $\phi$ , does $M \models \phi$ ?) can be solved in time $O(n \cdot (|S| + |T|))$ . That is, model checking is linear in the size of the specification and linear in the size of the model.

For LTL model checking, the complexity is $O(|S| \cdot 2^{|\phi|})$ : linear in the model size but exponential in the specification length.

Intuition

CTL model checking works by labeling states bottom-up through the formula structure. For each subformula, you mark which states satisfy it. The fixpoint computations (for operators like $AG$ and $EU$ ) require traversing the transition relation, taking $O(|S| + |T|)$ per subformula. The total work is proportional to the number of subformulas times the model size.

LTL model checking is harder because LTL specifications can express properties that require tracking the full execution path, not just the current state. The specification is compiled into a Buchi automaton (potentially exponentially larger), and the model is composed with this automaton.

Proof Sketch

For CTL: by structural induction on $\phi$ . Atomic propositions are checked in $O(|S|)$ by the labeling function. Boolean connectives are checked in $O(|S|)$ . The temporal operators $EX$ , $EU$ , and $EG$ require fixpoint computation. $EX\phi$ labels all states with a successor satisfying $\phi$ : one pass over $T$ , costing $O(|T|)$ . $E[\phi_1 \, U \, \phi_2]$ uses a backward reachability computation from states satisfying $\phi_2$ , costing $O(|S| + |T|)$ . $EG\phi$ uses a greatest fixpoint computation on the subgraph of states satisfying $\phi$ , also $O(|S| + |T|)$ .

The full proof is in Clarke, Grumberg, and Peled, Model Checking, Chapter 4.

Why It Matters

The linear complexity (in the model) is what makes model checking practical. The exponential part is in the specification size, which is typically small (tens of operators). The model can have billions of states and still be checkable, especially with symbolic model checking (using BDDs or SAT solvers to represent state sets compactly).

This result underpins the practical success of model checking in hardware verification (Intel uses model checking for chip design), protocol verification (TLS, consensus algorithms), and software verification (SLAM, CEGAR).

Failure Mode

The state explosion problem: the model size $|S|$ can be exponential in the number of system components. A system with $n$ boolean variables has $2^n$ states. Concurrent systems with $k$ processes, each with $m$ states, have $m^k$ states. Symbolic methods (BDDs, SAT-based BMC) and abstraction (CEGAR) mitigate but do not eliminate this problem.

Proof Assistants in Practice

Lean 4

Lean 4 is both a programming language and a proof assistant. Its type system is based on the Calculus of Inductive Constructions (CIC). Key features:

Dependent types: types can depend on values, enabling precise specifications
Tactic mode: proofs can be written as sequences of proof tactics (automated proof steps)
Metaprogramming: tactics are written in Lean itself, making the system extensible
Mathlib: a large mathematical library (hundreds of thousands of theorems)

Lean is the primary target for AI-assisted theorem proving (DeepMind's AlphaProof, Meta's work on formal mathematics).

Coq

Coq is the oldest widely used proof assistant based on CIC. CompCert (a verified C compiler) and CertiKOS (a verified OS kernel) were developed in Coq. Its tactic language (Ltac) is mature but less ergonomic than Lean 4's.

Isabelle/HOL

Isabelle uses higher-order logic (simpler than CIC) and has strong automation. The seL4 verified microkernel was developed in Isabelle.

Verified Systems

CompCert (Leroy, 2006): a C compiler verified in Coq to preserve the semantics of every compiled program. For any source program $P$ and compiled program $C$ : if $P$ has well-defined behavior, then $C$ has the same observable behavior. The proof covers all optimization passes. This means compiler bugs cannot cause verified-source programs to misbehave.

seL4 (Klein et al., 2009): a microkernel verified in Isabelle/HOL. The proof establishes that the C implementation correctly implements the abstract specification, which includes memory isolation, capability-based access control, and interrupt handling. The verification took approximately 20 person-years.

Flyspeck (Hales et al., 2017): a formal proof of the Kepler conjecture (optimal sphere packing in 3D). The original proof by Hales (1998) relied on extensive computation that referees could not fully check. The Flyspeck project formalized the entire proof in HOL Light, producing a machine-verified certificate.

Connections to ML

Verified neural network properties. Given a neural network $f$ and an input region $X$ , verify that $f(x) \in Y$ for all $x \in X$ . For example: prove that a classifier's output does not change for any perturbation within an $\ell_\infty$ ball (adversarial robustness). Tools like alpha-beta-CROWN and MN-BaB use bound propagation and branch-and-bound to verify ReLU networks. These are sound (a verified property truly holds) but incomplete (they may fail to verify a property that holds).

Autoformalization. Use language models to translate informal mathematical proofs into formal proofs in Lean or Coq. This is an active research area: if successful, it would dramatically accelerate formalization of mathematics and enable AI systems to produce verifiable mathematical reasoning.

Agent specifications. For an RL agent deployed in a safety-critical setting, specify properties like "the agent never takes action X in state Y" as temporal logic formulas and verify them against the agent's policy (if the policy is a finite-state controller) or check them empirically with formal guarantees (if the policy is a neural network with verified bounds).

Common Confusions

Watch Out

Formal verification is not testing

Testing checks specific inputs. Verification proves a property for all inputs (or all reachable states). A test suite with 99.9% coverage can miss the one edge case that causes a disaster. Verification guarantees no such edge case exists, relative to the formal specification.

Watch Out

Verification does not mean the system is correct in every sense

Verification proves that the implementation matches the specification. If the specification is wrong, the system is verified but still does the wrong thing. The specification problem (formally capturing what you actually want) is the formal-methods analog of the reward design problem in RL.

Watch Out

Rice's theorem does not make verification useless

Rice's theorem says you cannot verify arbitrary programs for arbitrary properties. It does not say you cannot verify specific programs for specific properties. Model checking works for finite-state systems. Theorem proving works for programs where humans (or AI) can construct proofs. Abstraction and approximation handle many intermediate cases. The practical question is always: can we verify THIS system for THIS property at acceptable cost?

Watch Out

Proof assistants do not generate proofs automatically

Proof assistants check proofs, not find them. The human (or AI) provides the proof structure, and the assistant verifies that every step is logically valid. Some automation exists (tactic solvers, SMT integration), but for nontrivial theorems, significant human guidance is required. The gap between "the theorem is true" and "we have a machine-checked proof" can be years of effort.

Key Takeaways

Model checking exhaustively explores finite-state systems; CTL checking is linear in model size, exponential in specification size
Theorem proving constructs machine-checked proofs for arbitrary mathematical statements, including properties of infinite-state systems
The Curry-Howard correspondence equates proofs with programs, enabling proof assistants based on type theory
Rice's theorem sets fundamental limits: general verification is undecidable, but specific verification is often feasible
CompCert (verified compiler) and seL4 (verified microkernel) are landmark verified systems
Neural network verification, autoformalization, and agent specification are active ML-formal methods intersections
Verification proves implementation matches specification; it does not prove the specification is correct

Exercises

ExerciseCore

Problem

A traffic light controller has states $\{R, Y, G\}$ (red, yellow, green) for two directions (North-South and East-West), giving $9$ combined states. Write a CTL safety property stating that both directions are never green simultaneously. How many states does the model checker need to explore?

ExerciseCore

Problem

In Lean 4, the statement theorem add_comm (n m : Nat) : n + m = m + n declares that natural number addition is commutative. Explain, using the Curry-Howard correspondence, what this declaration asks you to produce.

ExerciseAdvanced

Problem

A ReLU neural network $f: \mathbb{R}^d \to \mathbb{R}$ defines a piecewise linear function. For a given input $x_0$ and perturbation radius $\epsilon$ , the robustness verification problem asks: is $f(x) > 0$ for all $x$ with $\|x - x_0\|_\infty \leq \epsilon$ ? Explain why this is NP-hard in general and describe the approach used by bound-propagation verifiers.

References

Canonical:

Clarke, Grumberg, Peled, Model Checking (2nd ed., 2018), Chapters 1-4
Baier & Katoen, Principles of Model Checking (2008), Chapters 5-6
Pierce, Software Foundations (online textbook for Coq-based verification)

Verified Systems:

Leroy, "Formal Verification of a Realistic Compiler" (CompCert, CACM 2009)
Klein et al., "seL4: Formal Verification of an OS Kernel" (SOSP 2009)

ML Connections:

Katz et al., "Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks" (CAV 2017)
First, Rabe, Bansal, "Baldur: Whole-Proof Generation and Repair with Large Language Models" (2023)
Wu et al., "An Integer Linear Programming Framework for Mining Constraints from Data" (verified ML properties, 2022)

Next Topics

Computability theory: the theoretical foundations underlying Rice's theorem and the limits of automated reasoning

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Basic Logic and Proof TechniquesLayer 0A
Type TheoryLayer 0A
Lambda CalculusLayer 0A

Next Topics

Computability TheoryContinue →