PAC-Bayes Bounds

Sneiderman, Robby

Modern Generalization

PAC-Bayes Bounds

PAC-Bayes bounds control the generalization gap of a stochastic predictor by the KL divergence between a learner-chosen posterior and a data-independent prior. They have produced some of the few non-vacuous generalization bounds reported for trained neural networks (Dziugaite-Roy 2017 onward); how tight the bound gets depends heavily on the choice of prior.

AdvancedTier 1CurrentCore spine~30 min

Prerequisites

Rademacher Complexity Bayesian Estimation PAC Learning Framework Sample Complexity Bounds

Start 8-question practice · 4 available Prereq Map

Learning position

Read this page in the graph.

modern-generalization | layer 3 | tier 1. This page has 4 direct prerequisites and 3 published dependents.

Open Atlas Prerequisites Leads to

What next

Algorithmic Stability

This is the first curated or graph-derived continuation from the current page.

Evidence badge

8 Lean-backed claims

0 match the public claim scope; 8 are dependency proofs. 0 sorry/admit markers are recorded.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Classical generalization bounds such as VC dimension and Rademacher complexity ask how rich the entire hypothesis class is. For modern neural networks, that question is often too coarse. The class is expressive enough to memorize random labels, so the resulting bounds are typically vacuous: they predict generalization error larger than 1 even when the trained network performs well.

PAC-Bayes asks a sharper question: after training, how far did the learner move from a reference distribution chosen before seeing the data? The answer is measured by $\mathrm{KL}(Q \| P)$ between a prior $P$ and a posterior $Q$ . If sampled predictors from $Q$ fit the training data and $Q$ is not too far from $P$ , the test risk is small.

This is why PAC-Bayes matters so much in modern ML theory. It is one of the few general frameworks that:

explains why parameter count alone is the wrong complexity measure for deep networks
connects rigorously to compression, noise robustness, and flatness-style intuitions
has produced genuinely non-vacuous certificates for neural networks

Mental Model

Imagine writing down your learned predictor in two stages.

First, announce a reference distribution $P$ before seeing data. This might encode random initialization, a pretraining checkpoint, a symmetry-respecting Gaussian prior, or some other inductive bias that is fixed in advance.
After seeing the data, choose a posterior $Q$ that concentrates on predictors with low training error.

PAC-Bayes says the test error of the stochastic predictor $h \sim Q$ is controlled by two quantities:

how well predictors sampled from $Q$ fit the training data
how many extra bits are needed to describe $Q$ relative to $P$

That second quantity is exactly the KL term. So PAC-Bayes is best read as a code-length bound on learned solutions. You are not paying for the raw size of the network. You are paying for how much information the training data forced you to add on top of the prior.

Formal Setup

Let $\mathcal{H}$ be a hypothesis space. Let $P$ be a prior distribution over $\mathcal{H}$ , chosen before seeing data. Let $Q$ be a posterior distribution over $\mathcal{H}$ , which may depend on the training data $S = \{(x_1, y_1), \ldots, (x_n, y_n)\}$ drawn i.i.d. from $\mathcal{D}$ .

Definition

Stochastic Classifier $Q$

A stochastic classifier draws $h \sim Q$ and predicts using $h$ . The population risk of $Q$ is:

$R(Q) = \mathbb{E}_{h \sim Q}[R(h)] = \mathbb{E}_{h \sim Q}\left[\mathbb{E}_{(x,y) \sim \mathcal{D}}[\ell(h(x), y)]\right]$

The empirical risk of $Q$ is:

$\hat{R}_n(Q) = \mathbb{E}_{h \sim Q}\left[\frac{1}{n} \sum_{i=1}^{n} \ell(h(x_i), y_i)\right]$

Definition

KL Divergence $K L (Q ∥ P)$

The Kullback-Leibler divergence from $Q$ to $P$ is:

$\mathrm{KL}(Q \| P) = \mathbb{E}_{h \sim Q}\left[\log \frac{Q(h)}{P(h)}\right]$

This measures how much $Q$ diverges from the prior $P$ . It is zero when $Q = P$ and infinite when $Q$ has support outside $P$ .

Lean-verified finite PAC-Bayes layer

The current Lean artifact verifies the finite-distribution layer used inside PAC-Bayes proofs: KL nonnegativity, Donsker-Varadhan, posterior averages, and a posterior-gap change-of-measure adapter for finite PMFs. It also verifies the positive- $\lambda$ rearrangement that turns a bounded prior log-moment into the canonical posterior-risk form, the finite product-sample MGF bridge for bounded losses, a finite [0,1] bounded-loss Catoni-style confidence bound, the fixed-budget McAllester-style square-root corollary, and a finite-grid McAllester-style peeling bound that supports posterior-dependent penalties through an explicit grid-cover certificate. The scope is finite throughout: finite data domain, finite hypothesis index, finite prior and posterior PMFs, positive confidence allocations, finite grids, and scalar losses in $[0,1]$ .

Lemma

Finite-PMF KL Nonnegativity (Lean-verified)

Statement

For finite distributions $\rho$ and $\pi$ on the same finite index set, with $\pi_i > 0$ for every $i$ :

$\mathrm{KL}(\rho \| \pi) = \sum_i \rho_i \log\frac{\rho_i}{\pi_i} \geq 0.$

The Lean theorem is TheoremPath.LearningTheory.PACBayesKL.klDiv_nonneg.

Intuition

This is Gibbs' inequality in the finite setting. It is the basic reason the PAC-Bayes complexity penalty is never negative when the posterior is absolutely continuous with respect to the prior.

Proof Sketch

The Lean proof sums the per-coordinate inequality $\rho_i\log(\rho_i/\pi_i) \geq \rho_i-\pi_i$ , then uses $\sum_i \rho_i = \sum_i \pi_i = 1$ .

Failure Mode

This verifies the finite-PMF, full-support-prior setting. Sample-dependent PAC-Bayes bounds and measure-theoretic posteriors on continuous hypothesis spaces are separate extension targets.

report a correction →

Lemma

Donsker-Varadhan Finite Change-of-Measure Inequality (Lean-verified)

Statement

For any finite posterior $\rho$ , full-support prior $\pi$ , and function $f$ :

$\sum_i \rho_i f_i \leq \mathrm{KL}(\rho \| \pi) + \log\left(\sum_i \pi_i e^{f_i}\right).$

The Lean theorem is TheoremPath.LearningTheory.PACBayesKL.donsker_varadhan.

Intuition

This is the finite change-of-measure step used in PAC-Bayes proofs: it lets you replace an expectation under the learned posterior by a KL penalty plus a log-moment term under the prior.

Proof Sketch

The Lean proof defines the Gibbs posterior $q_i \propto \pi_i e^{f_i}$ , applies KL nonnegativity to $\mathrm{KL}(\rho\|q)$ , and rearranges the resulting identity.

Failure Mode

This is the variational inequality layer. The finite bounded-loss confidence bound below adds the iid finite-product MGF and confidence step; continuous posteriors and infinite hypothesis spaces remain separate targets.

report a correction →

Lemma

Finite Posterior-Gap Change-of-Measure Adapter (Lean-verified)

Statement

For any finite posterior $\rho$ , full-support prior $\pi$ , real parameter $\lambda$ , and finite risk vectors $R_i$ and $\hat R_i$ :

$\lambda\,(R_\rho - \hat R_\rho) \leq \mathrm{KL}(\rho \| \pi) + \log\left(\sum_i \pi_i \exp(\lambda(R_i-\hat R_i))\right).$

The Lean theorem is TheoremPath.LearningTheory.PACBayesKL.posterior_generalization_gap_change_of_measure.

Intuition

This is the finite posterior-average form of Donsker-Varadhan applied to the pointwise generalization gap. It is the bridge from a posterior-weighted gap to a KL penalty plus a prior exponential moment.

Proof Sketch

The Lean proof first rewrites $R_\rho-\hat R_\rho$ as $\sum_i \rho_i(R_i-\hat R_i)$ , then applies the finite Donsker-Varadhan inequality with $f_i=\lambda(R_i-\hat R_i)$ .

Failure Mode

This is the deterministic finite adapter. The finite bounded-loss confidence bound below adds the iid sample distribution, exponential-moment bound, and confidence parameter $\delta$ under [0,1] losses.

report a correction →

Lemma

Finite Posterior-Risk Lambda Rearrangement (Lean-verified)

Statement

For any finite posterior $\rho$ , full-support prior $\pi$ , positive $\lambda$ , finite risk vectors $R_i$ and $\hat R_i$ , and log-moment bound $c$ satisfying

$\log\left(\sum_i \pi_i \exp(\lambda(R_i-\hat R_i))\right) \leq c,$

the posterior risk obeys

$R_\rho \leq \hat R_\rho + \frac{\mathrm{KL}(\rho \| \pi) + c}{\lambda}.$

The Lean theorem is TheoremPath.LearningTheory.PACBayesKL.posterior_risk_le_empiricalRisk_plus_of_log_moment_bound.

Intuition

This is the algebraic PAC-Bayes shape after change-of-measure but before probability enters. It says that once a separate iid-sample argument controls the prior exponential moment, every posterior pays only KL plus that log-moment bound, scaled by $1/\lambda$ .

Proof Sketch

The Lean proof divides the finite posterior-gap change-of-measure inequality by positive $\lambda$ , substitutes the external log-moment bound, and rearranges $R_\rho-\hat R_\rho \leq (\mathrm{KL}+c)/\lambda$ into posterior-risk form.

Failure Mode

This is still deterministic and finite. The finite bounded-loss confidence bound below instantiates $c$ for finite iid samples and [0,1] losses; the finite-grid square-root theorem adds a confidence-grid layer for posterior-dependent penalties. Measure-theoretic PAC-Bayes remains a separate target.

report a correction →

Lemma

Finite Prior-Moment Markov Confidence Adapter (Lean-verified)

Statement

For a finite sample distribution $\nu$ , prior $\pi$ , parameter $\lambda$ , finite risks $R_i$ , sample-dependent empirical risks $\hat R_i(\omega)$ , and constants $C>0$ and $\delta>0$ , if

$\mathbb{E}_{\omega\sim\nu}\left[\sum_i \pi_i \exp(\lambda(R_i-\hat R_i(\omega)))\right] \leq C,$

then the finite sample mass of outcomes where the prior moment exceeds $C/\delta$ is at most $\delta$ :

$\nu\left\{\omega: \sum_i \pi_i \exp(\lambda(R_i-\hat R_i(\omega))) \geq C/\delta\right\} \leq \delta.$

The Lean theorem is TheoremPath.LearningTheory.PACBayesKL.priorExpMoment_tailMass_le_delta_of_expected_bound.

Intuition

This is the Markov step that turns an expected prior exponential-moment bound into a high-probability statement. It is useful because PAC-Bayes sample bounds need exactly this kind of event before applying the deterministic posterior-risk adapter.

Proof Sketch

The Lean proof defines the finite prior exponential moment for each sample outcome, proves it is nonnegative under a nonnegative prior, applies a finite weighted Markov inequality to the threshold $C/\delta$ , and simplifies $C/(C/\delta)=\delta$ .

Failure Mode

This is finite-support Markov only. The finite bounded-loss confidence theorem below supplies one concrete iid product-sample instantiation for [0,1] losses; it is not a continuous-posterior or infinite-class PAC-Bayes theorem.

report a correction →

Theorem

Finite Bounded-Loss Catoni-Style PAC-Bayes Bound (Lean-verified)

Statement

For a finite data distribution $p$ , full-support finite prior $\pi$ , positive sample size $n$ , positive $\lambda$ , positive $\delta$ , and losses $\ell_i(z)\in[0,1]$ , the product-sample mass of samples $S$ for which there exists a finite posterior $\rho$ violating

R(\rho) \leq \hat R_S(\rho) + \frac{\mathrm{KL}(\rho\|\pi)+\log(1/\delta)}{\lambda} + \frac{\lambda}{8n}

is at most $\delta$ .

The Lean theorem is TheoremPath.LearningTheory.PACBayesBoundedLoss.finiteCatoni_badEventMass_le_delta.

Intuition

This is an end-to-end finite PAC-Bayes confidence statement in the Lean stack. Hoeffding controls the one-coordinate bounded-loss MGF, the finite product bridge lifts it to an iid sample average, Markov turns the prior exponential moment into a confidence event, and Donsker-Varadhan transfers that event from the prior to every finite posterior.

Proof Sketch

The Lean proof combines the one-coordinate bounded-loss MGF, the finite product-sample MGF factorization, prior averaging, a finite weighted Markov step, and the deterministic posterior-risk adapter. The final statement bounds the mass of bad samples by showing every sample that violates the posterior-risk inequality must also lie in the prior-moment tail event.

Failure Mode

This is finite, scalar-valued, and fixed- $\lambda$ . The square-root corollaries below optimize through fixed budgets and finite grids; continuous posteriors and infinite hypothesis spaces remain separate targets.

report a correction →

Theorem

Finite Fixed-Budget McAllester-Style PAC-Bayes Bound (Lean-verified)

Statement

For a finite data distribution $p$ , full-support finite prior $\pi$ , positive sample size $n$ , positive confidence parameter $\delta$ , positive complexity budget $C$ , and losses $\ell_i(z)\in[0,1]$ , the product-sample mass of samples $S$ for which there exists a finite posterior $\rho$ with

$\mathrm{KL}(\rho\|\pi)+\log(1/\delta)\leq C$

but violating

R(\rho) \leq \hat R_S(\rho) + \sqrt{\frac{C}{2n}}

is at most $\delta$ .

The Lean theorem is TheoremPath.LearningTheory.PACBayesBoundedLoss.finiteMcAllesterBoundedComplexity_badEventMass_le_delta.

Intuition

The fixed- $\lambda$ Catoni theorem has a penalty $C/\lambda+\lambda/(8n)$ once a posterior's KL-plus-confidence term is bounded by $C$ . Choosing $\lambda=\sqrt{8nC}$ turns that expression into the familiar square-root penalty $\sqrt{C/(2n)}$ .

Proof Sketch

The Lean proof first proves the algebraic optimizer identity for the fixed budget, then applies the deterministic posterior-risk adapter at $\lambda=\sqrt{8nC}$ . The bad-event theorem reduces any square-root violation to the same prior-moment tail event already controlled by the finite bounded-loss Catoni layer.

Failure Mode

This is a fixed-budget finite corollary. The finite-grid theorem below removes the single-budget restriction using finitely many confidence buckets, but continuous posteriors, infinite hypothesis spaces, and all-real- $\lambda$ optimization remain separate targets.

report a correction →

Theorem

Finite-Grid Optimized McAllester-Style PAC-Bayes Bound (Lean-verified)

Statement

For a finite data distribution $p$ , full-support finite prior $\pi$ , positive sample size $n$ , losses $\ell_i(z)\in[0,1]$ , and a finite grid of positive complexity budgets $C_g$ with confidence allocations $\delta_g$ , suppose $\sum_g\delta_g\leq\delta$ . If a posterior-dependent penalty $\eta(\rho)$ is covered by the grid in the sense that every finite posterior $\rho$ has some bucket $g$ satisfying

\mathrm{KL}(\rho\|\pi)+\log(1/\delta_g)\leq C_g \quad\text{and}\quad \sqrt{\frac{C_g}{2n}}\leq\eta(\rho),

then the product-sample mass of samples $S$ for which there exists a finite posterior $\rho$ violating

R(\rho)\leq \hat R_S(\rho)+\eta(\rho)

is at most $\delta$ .

The Lean theorem is TheoremPath.LearningTheory.PACBayesBoundedLoss.finiteMcAllesterGridOptimized_badEventMass_le_delta.

Intuition

The fixed-budget theorem controls one square-root bucket. The grid theorem gives each bucket its own confidence allocation and then applies a finite union bound. The optimized wrapper says that if every posterior can be assigned to a bucket whose square-root penalty is no larger than the penalty you want to use, the same high-probability event controls all finite posteriors at once.

Proof Sketch

The Lean proof defines the finite bad-sample set for each bucket, applies the fixed-budget McAllester theorem to each bucket, sums the bucket masses with the finite weighted union-bound lemma, and then uses the grid-cover certificate to move from bucket-specific violations to the posterior-dependent penalty event.

Failure Mode

This is a finite-grid theorem. It does not quantify over all real $\lambda$ , continuous posterior families, infinite hypothesis classes, or arbitrary measurable hypothesis spaces.

report a correction →

What The Bound Is Charging You For

The canonical PAC-Bayes structure looks like:

\text{test risk} \;\lesssim\; \text{empirical stochastic risk} + \sqrt{\frac{\mathrm{KL}(Q \| P) + \log(1/\delta)}{n}}

So there are only three real levers:

Empirical stochastic risk. If samples from $Q$ already make mistakes on the training set, the bound starts from a bad baseline.
Relative information cost. If $Q$ must be extremely concentrated or far from $P$ , the KL term explodes.
Sample size. PAC-Bayes still needs data; the KL term only becomes harmless when it is small relative to $n$ .

This decomposition is why the framework is so useful. It makes the failure mode visible. When a bound is vacuous, you can ask whether the issue is the empirical loss, the prior, the posterior width, or simply insufficient sample size.

Main Theorems

Theorem

McAllester PAC-Bayes Bound

Statement

For any prior $P$ over $\mathcal{H}$ chosen independently of $S$ , for any $\delta > 0$ , with probability at least $1 - \delta$ over the draw of $S$ , simultaneously for all posteriors $Q$ :

$R(Q) \leq \hat{R}_n(Q) + \sqrt{\frac{\mathrm{KL}(Q \| P) + \log(n/\delta)}{2n}}$

Intuition

The bound has two terms. The first is the training performance of the stochastic classifier. The second is a complexity penalty that grows with $\mathrm{KL}(Q \| P)$ . If the posterior concentrates near the prior, the penalty is small. The bound holds simultaneously for all posteriors $Q$ , so you can optimize $Q$ after seeing data without invalidating the guarantee.

Proof Sketch

The proof uses a change-of-measure argument. Start from the moment generating function: for any fixed $h$ , $\mathbb{E}_S[e^{n(R(h) - \hat{R}_n(h))^2}]$ is bounded by Hoeffding's lemma. Then use the Donsker-Varadhan variational formula: $\mathbb{E}_{h \sim Q}[f(h)] \leq \mathrm{KL}(Q \| P) + \log \mathbb{E}_{h \sim P}[e^{f(h)}]$ for any measurable $f$ . Apply this with $f(h) = n(R(h) - \hat{R}_n(h))^2$ , take expectations over $S$ , and apply Markov's inequality.

Why It Matters

Unlike VC and Rademacher bounds, the PAC-Bayes bound depends on the relationship between the learned solution and a reference prior, not on the raw capacity of the hypothesis class. For a network with 10 million parameters, if the learned weights are close to initialization (a natural prior), the KL term can be small enough to give non-vacuous bounds.

Failure Mode

The bound is only useful when $\mathrm{KL}(Q \| P)$ is much smaller than $n$ . If the prior is poorly chosen (far from the learned solution), the KL term dominates and the bound becomes vacuous. In practice, choosing the prior is the main difficulty. The bound also requires bounded loss; for unbounded losses, modified versions with sub-Gaussian assumptions are needed.

report a correction →

Theorem

Catoni PAC-Bayes Bound

Statement

For any prior $P$ , any $\delta > 0$ , and any $\lambda > 0$ , with probability at least $1 - \delta$ over $S$ , for all posteriors $Q$ :

$R(Q) \leq \frac{1}{1 - e^{-\lambda}}\left(1 - \exp\left(-\lambda \hat{R}_n(Q) - \frac{\mathrm{KL}(Q \| P) + \log(1/\delta)}{n}\right)\right)$

Intuition

Catoni's bound is tighter than McAllester's because it avoids the square root. The parameter $\lambda$ can be optimized. For small empirical risk and small KL, Catoni gives substantially better numerical bounds.

Proof Sketch

Uses the exponential moment method directly. For bounded losses, the cumulant generating function satisfies $\log \mathbb{E}[e^{-\lambda(\ell - R)}] \leq \log(1 - R + Re^{-\lambda})$ . Combine with Donsker-Varadhan and take expectations over $P$ .

Why It Matters

In the race to produce non-vacuous bounds for deep networks, Catoni's bound consistently produces tighter numerical values than McAllester's. Dziugaite and Roy (2017) used optimized Catoni bounds to get the first non-vacuous generalization bound for a neural network on MNIST.

Failure Mode

Same prior-dependence issue as McAllester. The $\lambda$ parameter must be chosen or optimized, adding another degree of freedom. For very small training error, the bound approaches the KL term divided by $n$ , which is the irreducible cost of choosing the posterior.

report a correction →

Why PAC-Bayes Is Competitive for Deep Learning

The key insight is that overparameterized networks may have enormous raw capacity while still landing in solutions with modest effective complexity relative to a sensible prior. PAC-Bayes captures that through several complementary mechanisms.

Compression view. If $Q$ is a Gaussian around the learned weights and $P$ is a broader Gaussian around initialization, then $\mathrm{KL}(Q \| P)$ acts like the number of bits needed to describe the solution relative to initialization. This is why PAC-Bayes and compression-based generalization often feel like two views of the same story.

Noise robustness. If the predictor keeps working when you perturb the weights, then you can choose a broader posterior $Q$ . Broader posteriors are cheaper in KL, so noise-robust solutions are easier to certify. This is the rigorous core behind the "flat minima" intuition, though PAC-Bayes is more careful than naive flatness metrics.

Reference-relative complexity. A network with ten million parameters can still have a small PAC-Bayes complexity if it stays close to a good prior. Parameter count by itself is not the operative quantity.

Data-dependent priors. If you use one data split to build the prior and a disjoint split to certify the posterior, the prior can sit much closer to the eventual solution. Ambroladze, Parrado-Hernandez, and Shawe-Taylor (2007) formalized this idea.

What Practitioners Actually Optimize

In modern PAC-Bayes work, the goal is usually not to apply the theorem after training as an afterthought. Instead, people optimize a surrogate objective that balances:

low empirical risk under posterior noise
low KL to the chosen prior

For Gaussian posteriors over network weights, this means learning both a center and a scale. The center wants to fit the data. The scale wants to stay as broad as possible without destroying the empirical risk. That tradeoff is exactly why the bound is interesting: it turns "robust to perturbations" into a quantitative certificate.

This is also why Catoni-style bounds matter in practice. Once you are numerically optimizing a certificate, constants and square roots stop being cosmetic. They can be the difference between vacuous and non-vacuous.

From Stochastic to Deterministic Predictors

The theorem directly certifies the Gibbs classifier that samples $h \sim Q$ at prediction time. But most deployed networks are deterministic. So why does PAC-Bayes still matter?

There are two answers.

First, some papers certify the stochastic network itself. This is already meaningful: it shows that a noisy version of the trained predictor generalizes.

Second, if predictions are stable under posterior noise, then the deterministic center of $Q$ and the stochastic predictor often behave similarly on most examples. PAC-Bayes does not give that deterministic guarantee for free, but it provides the right quantitative object: a posterior that is both accurate and robust to perturbation. In binary classification there are also PAC-Bayesian majority-vote bounds, but the Gibbs view is the simplest entry point.

Comparison with Classical Bounds

Bound type	Depends on	Deep net behavior
VC dimension	Number of parameters	Vacuous (billions of params)
Rademacher complexity	Norm constraints on weights	Often vacuous for practical norms
PAC-Bayes	KL(posterior, prior)	Non-vacuous with good prior
Algorithmic stability	Sensitivity to one sample	Bounded for SGD with early stopping

PAC vs. PAC-Bayes: When Each One Wins

PAC-Bayes is not a replacement for classical PAC bounds. They occupy different regimes.

Question	Classical PAC	PAC-Bayes
What does the bound depend on?	VC / Rademacher complexity of the whole class	KL between learned posterior and chosen prior
Sample complexity in $n$	$\Theta(d_{\text{VC}}/\epsilon^2)$	$\Theta((\mathrm{KL} + \log(1/\delta))/n)$
Tight in worst case?	Yes (matching VC lower bounds)	No worst-case lower bound; depends on prior
Useful for overparameterized nets?	No (capacity penalty too large)	Yes (with a prior near the learned solution)
Requires data-independent prior?	No	Yes (or split data)
Sample complexity in $\epsilon$ at fixed $\mathrm{KL}$	$1/\epsilon^2$	$1/\epsilon^2$ (same MGF route)

The right way to read PAC-Bayes: it does not improve the $1/\epsilon^2$ rate that worst-case PAC theory shows is unimprovable. It improves the complexity term, replacing $d_{\text{VC}}$ (which can be $\Theta(d)$ for a $d$ -parameter network) with $\mathrm{KL}(Q \| P)$ (which can be much smaller if the prior is well-chosen). Both bounds are vacuous when the complexity term exceeds $n$ ; both are useful when the complexity term is much smaller than $n$ . PAC-Bayes wins in regimes where the worst-case capacity is huge but the learned solution is close to a sensible prior.

Worked Numeric Example: How a Good Prior Salvages the Bound

The advanced exercise below shows a vacuous bound for a poorly-chosen prior. Here is the same setup with two choices of prior, side by side. Network: $d = 10^6$ parameters, $n = 5 \times 10^4$ training samples, empirical stochastic risk $\hat{R}_n(Q) = 0.02$ , $\delta = 0.05$ .

Prior $P$	$\mathrm{KL}(Q \\| P)$	McAllester penalty $\sqrt{(\mathrm{KL} + \log(n/\delta))/(2n)}$	Bound	Vacuous?
$\mathcal{N}(0, 0.1 \cdot I_d)$ , $\sigma_Q = 0.1$	$7.0 \times 10^5$	$2.65$	$> 1$	Yes
$\mathcal{N}(0, 0.1 \cdot I_d)$ , $\sigma_Q = 0.32$	$1.1 \times 10^5$	$1.05$	$> 1$	Yes
$\mathcal{N}(w_{\text{init}}, 0.1 \cdot I_d)$ , $\sigma_Q = 0.1$	$5 \times 10^3$	$0.224$	$0.244$	No
Data-dependent prior $\mathcal{N}(w_{\text{half}}, 0.05 \cdot I_d)$ , $\sigma_Q = 0.05$	$5 \times 10^2$	$0.072$	$0.092$	No

Three observations:

Naive prior fails. A spherical Gaussian centered at zero charges the bound for $\|w\|^2$ , which scales with the network's parameter count. Vacuous.
Centering the prior at initialization changes the KL term from $\|w\|^2/(2\sigma_P^2)$ to $\|w - w_{\text{init}}\|^2/(2\sigma_P^2)$ . If training only moves the weights moderately, the gap is much smaller. Bound becomes non-vacuous.
Data-dependent priors (Ambroladze-Parrado-Hernandez-Shawe-Taylor 2007): split data, train a "warm-start" prior on $S_1$ , certify on $S_2$ . The KL term shrinks further because $w$ is even closer to $w_{\text{half}}$ than to $w_{\text{init}}$ . This is the route to genuinely tight modern certificates.

The same data, the same loss, the same algorithm — only the prior changed. PAC-Bayes is a framework for converting modeling choices about the prior into rigorous certificates; the modeling choices matter as much as the bound itself.

Common Confusions

Watch Out

PAC-Bayes does not require Bayesian inference

Despite the name, PAC-Bayes bounds are frequentist. The prior $P$ is not a belief; it is a reference distribution chosen for the bound. The posterior $Q$ is not computed by Bayes rule; it is any distribution, typically optimized to minimize the bound. You can apply PAC-Bayes to a deterministic network by setting $Q$ to be a Gaussian centered on the trained weights.

Watch Out

Non-vacuous does not mean tight

The first non-vacuous bound for MNIST (Dziugaite and Roy, 2017) was about 0.16 for a network with test error around 0.02. Non-vacuous means the bound is less than 1 (better than random guessing), which is a milestone but not a precise characterization of generalization.

Watch Out

PAC-Bayes is not the same thing as naive flat-minima folklore

It is tempting to summarize PAC-Bayes as "flat minima generalize." That is too loose. Simple flatness metrics can change under reparameterization and therefore fail as intrinsic complexity measures. PAC-Bayes is more disciplined: you must specify an actual posterior, an actual prior, and an actual KL cost. Flatness only matters insofar as it lets you choose a broad posterior with low empirical stochastic risk.

Watch Out

The prior must be independent of training data

If you set $P$ after looking at the training data, the bound is invalid. The data-splitting trick (use half the data for $P$ , half for the bound) is the standard workaround. Forgetting this independence requirement is a common error.

Canonical Examples

Example

Gaussian prior and posterior

Let $P = \mathcal{N}(0, \sigma_P^2 I_d)$ and $Q = \mathcal{N}(w, \sigma_Q^2 I_d)$ where $w$ are the trained weights and $d$ is the number of parameters. Then:

$\mathrm{KL}(Q \| P) = \frac{d}{2}\left(\frac{\sigma_Q^2}{\sigma_P^2} - 1 - \log\frac{\sigma_Q^2}{\sigma_P^2}\right) + \frac{\|w\|^2}{2\sigma_P^2}$

For $\sigma_Q \ll \sigma_P$ (narrow posterior, broad prior), this is approximately $\frac{\|w\|^2}{2\sigma_P^2} + \frac{d}{2}\log\frac{\sigma_P^2}{\sigma_Q^2}$ . The first term penalizes large weights; the second penalizes many parameters that must be specified precisely. If $\sigma_Q$ can be large (the network is insensitive to weight perturbations), the second term shrinks.

Example

How a good prior changes the story

Suppose you use random initialization as the center of the prior. Then the KL term pays for the distance between the learned weights and initialization, plus the precision of the posterior. If training only moves the weights moderately and the loss stays low under noise, the certificate can be surprisingly good even in a huge network.

Now suppose instead that your prior is centered far from any good solution. The posterior must travel a long distance or become extremely sharp to fit the data, and the KL term becomes enormous. This is why prior design is the central craft problem in PAC-Bayes.

Summary

PAC-Bayes controls generalization through empirical stochastic risk plus a KL penalty $\mathrm{KL}(Q \| P)$
The framework charges for information added beyond the prior, not for raw parameter count
Good certificates come from a well-chosen prior, a broad accurate posterior, and enough data
Catoni-style bounds matter because practical PAC-Bayes work is numerical, not just asymptotic
The framework gives a rigorous formulation of two folklore intuitions about deep nets — that flat / compressible solutions generalize — but only via specific prior and posterior choices; flatness alone is not the same statement

Exercises

ExerciseCore

Problem

Let $P = \mathcal{N}(0, I_d)$ and $Q = \mathcal{N}(\mu, I_d)$ for some $\mu \in \mathbb{R}^d$ . Compute $\mathrm{KL}(Q \| P)$ . How does it scale with $d$ and $\|\mu\|$ ?

ExerciseAdvanced

Problem

A network has $d = 10^6$ parameters trained on $n = 50{,}000$ MNIST images with 0-1 loss. You set $P = \mathcal{N}(0, 0.1 \cdot I_d)$ and $Q = \mathcal{N}(w, 0.01 \cdot I_d)$ where $\|w\|^2 = 100$ . The empirical risk of $Q$ (measured by sampling from $Q$ and averaging) is 0.02. Using McAllester's bound with $\delta = 0.05$ , is the bound non-vacuous?

ExerciseResearch

Problem

Explain why data-dependent priors (using part of the training set to choose $P$ ) can dramatically improve PAC-Bayes bounds. What is the formal justification, and what fraction of data should you allocate to the prior?

References

Canonical:

McAllester, "PAC-Bayesian Model Averaging" (1999), Sections 1-3
Seeger, "PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification" (2002), Sections 1-2
Catoni, PAC-Bayesian Supervised Classification (2007), Chapters 1-2
Guedj, "A Primer on PAC-Bayesian Learning" (2019)

Current:

Ambroladze, Parrado-Hernandez, Shawe-Taylor, "Tighter PAC-Bayes Bounds" (2007)
Dziugaite & Roy, "Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks" (2017)
Perez-Ortiz, Rivasplata, Shawe-Taylor, Szepesvari, "Tighter Risk Certificates for Neural Networks" (JMLR 2021)

Next Topics

Algorithmic stability: an alternative path to generalization bounds that does not require a prior
Implicit bias and modern generalization: why SGD finds solutions with small PAC-Bayes complexity

Last reviewed: May 4, 2026

Claim evidence

Selected claims on this topic have machine-checked support.

Collapsed by default because this is audit material. Open it to see exact theorem names, claim scopes, and source roles.

Click to expand

0

claim matches

8

dependency proofs

0

incomplete markers

This is claim-level evidence, not a whole-page badge. The checked theorem must match the recorded claim scope. Supporting lemmas stay labeled as dependency proofs, not full claim matches.