Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Modern Generalization

PAC-Bayes Bounds

Generalization bounds that depend on the KL divergence between a learned posterior and a prior over hypotheses. PAC-Bayes gives non-vacuous bounds for overparameterized networks where VC and Rademacher bounds fail.

AdvancedTier 2Current~60 min

Why This Matters

Classical generalization bounds (VC dimension, Rademacher complexity) scale with the number of parameters. For modern neural networks with millions or billions of parameters, these bounds are vacuous: they predict generalization error larger than 1 even for networks that generalize well in practice.

PAC-Bayes bounds offer a different approach. Instead of measuring the size of the hypothesis class, they measure how much the learning algorithm's output (posterior) differs from a prior chosen before seeing data. If the learned solution is close to the prior in KL divergence, the bound is tight. This provides a path to non-vacuous generalization bounds for deep networks.

Formal Setup

Let H\mathcal{H} be a hypothesis space. Let PP be a prior distribution over H\mathcal{H}, chosen before seeing data. Let QQ be a posterior distribution over H\mathcal{H}, which may depend on the training data S={(x1,y1),,(xn,yn)}S = \{(x_1, y_1), \ldots, (x_n, y_n)\} drawn i.i.d. from D\mathcal{D}.

Definition

Stochastic Classifier

A stochastic classifier draws hQh \sim Q and predicts using hh. The population risk of QQ is:

R(Q)=EhQ[R(h)]=EhQ[E(x,y)D[(h(x),y)]]R(Q) = \mathbb{E}_{h \sim Q}[R(h)] = \mathbb{E}_{h \sim Q}\left[\mathbb{E}_{(x,y) \sim \mathcal{D}}[\ell(h(x), y)]\right]

The empirical risk of QQ is:

R^n(Q)=EhQ[1ni=1n(h(xi),yi)]\hat{R}_n(Q) = \mathbb{E}_{h \sim Q}\left[\frac{1}{n} \sum_{i=1}^{n} \ell(h(x_i), y_i)\right]

Definition

KL Divergence

The Kullback-Leibler divergence from QQ to PP is:

KL(QP)=EhQ[logQ(h)P(h)]\mathrm{KL}(Q \| P) = \mathbb{E}_{h \sim Q}\left[\log \frac{Q(h)}{P(h)}\right]

This measures how much QQ diverges from the prior PP. It is zero when Q=PQ = P and infinite when QQ has support outside PP.

Main Theorems

Theorem

McAllester PAC-Bayes Bound

Statement

For any prior PP over H\mathcal{H} chosen independently of SS, for any δ>0\delta > 0, with probability at least 1δ1 - \delta over the draw of SS, simultaneously for all posteriors QQ:

R(Q)R^n(Q)+KL(QP)+log(n/δ)2nR(Q) \leq \hat{R}_n(Q) + \sqrt{\frac{\mathrm{KL}(Q \| P) + \log(n/\delta)}{2n}}

Intuition

The bound has two terms. The first is the training performance of the stochastic classifier. The second is a complexity penalty that grows with KL(QP)\mathrm{KL}(Q \| P). If the posterior concentrates near the prior, the penalty is small. The bound holds simultaneously for all posteriors QQ, so you can optimize QQ after seeing data without invalidating the guarantee.

Proof Sketch

The proof uses a change-of-measure argument. Start from the moment generating function: for any fixed hh, ES[en(R(h)R^n(h))2]\mathbb{E}_S[e^{n(R(h) - \hat{R}_n(h))^2}] is bounded by Hoeffding's lemma. Then use the Donsker-Varadhan variational formula: EhQ[f(h)]KL(QP)+logEhP[ef(h)]\mathbb{E}_{h \sim Q}[f(h)] \leq \mathrm{KL}(Q \| P) + \log \mathbb{E}_{h \sim P}[e^{f(h)}] for any measurable ff. Apply this with f(h)=n(R(h)R^n(h))2f(h) = n(R(h) - \hat{R}_n(h))^2, take expectations over SS, and apply Markov's inequality.

Why It Matters

Unlike VC and Rademacher bounds, the PAC-Bayes bound depends on the relationship between the learned solution and a reference prior, not on the raw capacity of the hypothesis class. For a network with 10 million parameters, if the learned weights are close to initialization (a natural prior), the KL term can be small enough to give non-vacuous bounds.

Failure Mode

The bound is only useful when KL(QP)\mathrm{KL}(Q \| P) is much smaller than nn. If the prior is poorly chosen (far from the learned solution), the KL term dominates and the bound becomes vacuous. In practice, choosing the prior is the main difficulty. The bound also requires bounded loss; for unbounded losses, modified versions with sub-Gaussian assumptions are needed.

Theorem

Catoni PAC-Bayes Bound

Statement

For any prior PP, any δ>0\delta > 0, and any λ>0\lambda > 0, with probability at least 1δ1 - \delta over SS, for all posteriors QQ:

R(Q)11eλ(1exp(λR^n(Q)KL(QP)+log(1/δ)n))R(Q) \leq \frac{1}{1 - e^{-\lambda}}\left(1 - \exp\left(-\lambda \hat{R}_n(Q) - \frac{\mathrm{KL}(Q \| P) + \log(1/\delta)}{n}\right)\right)

Intuition

Catoni's bound is tighter than McAllester's because it avoids the square root. The parameter λ\lambda can be optimized. For small empirical risk and small KL, Catoni gives substantially better numerical bounds.

Proof Sketch

Uses the exponential moment method directly. For bounded losses, the cumulant generating function satisfies logE[eλ(R)]log(1R+Reλ)\log \mathbb{E}[e^{-\lambda(\ell - R)}] \leq \log(1 - R + Re^{-\lambda}). Combine with Donsker-Varadhan and take expectations over PP.

Why It Matters

In the race to produce non-vacuous bounds for deep networks, Catoni's bound consistently produces tighter numerical values than McAllester's. Dziugaite and Roy (2017) used optimized Catoni bounds to get the first non-vacuous generalization bound for a neural network on MNIST.

Failure Mode

Same prior-dependence issue as McAllester. The λ\lambda parameter must be chosen or optimized, adding another degree of freedom. For very small training error, the bound approaches the KL term divided by nn, which is the irreducible cost of choosing the posterior.

Why PAC-Bayes Works for Deep Learning

The key insight: overparameterized networks have many parameters, but the effective complexity of the learned solution may be low. PAC-Bayes captures this through three mechanisms.

Compression view. If the posterior QQ is a narrow Gaussian centered on the learned weights and the prior PP is a broader Gaussian centered on initialization, then KL(QP)\mathrm{KL}(Q \| P) measures how many bits are needed to describe the learned weights relative to initialization. Networks that learn "simple" functions (expressible with low-precision weights) have small KL.

Flat minima. The posterior width relates to the flatness of the loss landscape. If the loss is insensitive to perturbations of the weights (a flat minimum), then QQ can be broad, the KL can be small, and the bound is tight. Sharp minima require narrow QQ and yield larger KL.

Data-dependent priors. Using part of the training data to choose the prior (with the remaining data for the bound) can dramatically tighten PAC-Bayes bounds. Ambroladze, Parrado-Hernandez, and Shawe-Taylor (2007) formalized this data-splitting approach.

Comparison with Classical Bounds

Bound typeDepends onDeep net behavior
VC dimensionNumber of parametersVacuous (billions of params)
Rademacher complexityNorm constraints on weightsOften vacuous for practical norms
PAC-BayesKL(posterior, prior)Non-vacuous with good prior
Algorithmic stabilitySensitivity to one sampleBounded for SGD with early stopping

Common Confusions

Watch Out

PAC-Bayes does not require Bayesian inference

Despite the name, PAC-Bayes bounds are frequentist. The prior PP is not a belief; it is a reference distribution chosen for the bound. The posterior QQ is not computed by Bayes rule; it is any distribution, typically optimized to minimize the bound. You can apply PAC-Bayes to a deterministic network by setting QQ to be a Gaussian centered on the trained weights.

Watch Out

Non-vacuous does not mean tight

The first non-vacuous bound for MNIST (Dziugaite and Roy, 2017) was about 0.16 for a network with test error around 0.02. Non-vacuous means the bound is less than 1 (better than random guessing), which is a milestone but not a precise characterization of generalization.

Watch Out

The prior must be independent of training data

If you set PP after looking at the training data, the bound is invalid. The data-splitting trick (use half the data for PP, half for the bound) is the standard workaround. Forgetting this independence requirement is a common error.

Canonical Examples

Example

Gaussian prior and posterior

Let P=N(0,σP2Id)P = \mathcal{N}(0, \sigma_P^2 I_d) and Q=N(w,σQ2Id)Q = \mathcal{N}(w, \sigma_Q^2 I_d) where ww are the trained weights and dd is the number of parameters. Then:

KL(QP)=d2(σQ2σP21logσQ2σP2)+w22σP2\mathrm{KL}(Q \| P) = \frac{d}{2}\left(\frac{\sigma_Q^2}{\sigma_P^2} - 1 - \log\frac{\sigma_Q^2}{\sigma_P^2}\right) + \frac{\|w\|^2}{2\sigma_P^2}

For σQσP\sigma_Q \ll \sigma_P (narrow posterior, broad prior), this is approximately w22σP2+d2logσP2σQ2\frac{\|w\|^2}{2\sigma_P^2} + \frac{d}{2}\log\frac{\sigma_P^2}{\sigma_Q^2}. The first term penalizes large weights; the second penalizes many parameters that must be specified precisely. If σQ\sigma_Q can be large (the network is insensitive to weight perturbations), the second term shrinks.

Key Takeaways

  • PAC-Bayes bounds depend on KL(QP)\mathrm{KL}(Q \| P), not on the raw size of the hypothesis class
  • The prior PP must be chosen before seeing data (or via data-splitting)
  • Non-vacuous bounds for deep networks exist when the prior is chosen well (e.g., initialization)
  • Catoni's bound is tighter than McAllester's in practice
  • The connection to compression and flat minima explains why PAC-Bayes captures generalization better than VC/Rademacher for overparameterized models

Exercises

ExerciseCore

Problem

Let P=N(0,Id)P = \mathcal{N}(0, I_d) and Q=N(μ,Id)Q = \mathcal{N}(\mu, I_d) for some μRd\mu \in \mathbb{R}^d. Compute KL(QP)\mathrm{KL}(Q \| P). How does it scale with dd and μ\|\mu\|?

ExerciseAdvanced

Problem

A network has d=106d = 10^6 parameters trained on n=50,000n = 50{,}000 MNIST images with 0-1 loss. You set P=N(0,0.1Id)P = \mathcal{N}(0, 0.1 \cdot I_d) and Q=N(w,0.01Id)Q = \mathcal{N}(w, 0.01 \cdot I_d) where w2=100\|w\|^2 = 100. The empirical risk of QQ (measured by sampling from QQ and averaging) is 0.02. Using McAllester's bound with δ=0.05\delta = 0.05, is the bound non-vacuous?

ExerciseResearch

Problem

Explain why data-dependent priors (using part of the training set to choose PP) can dramatically improve PAC-Bayes bounds. What is the formal justification, and what fraction of data should you allocate to the prior?

References

Canonical:

  • McAllester, "PAC-Bayesian Model Averaging" (1999), Sections 1-3
  • Catoni, PAC-Bayesian Supervised Classification (2007), Chapters 1-2

Current:

  • Dziugaite & Roy, "Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks" (2017)
  • Perez-Ortiz, Rivasplata, Shawe-Taylor, Szepesvari, "Tighter Risk Certificates for Neural Networks" (JMLR 2021)

Next Topics

  • Algorithmic stability: an alternative path to generalization bounds that does not require a prior
  • Implicit bias and modern generalization: why SGD finds solutions with small PAC-Bayes complexity

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics