Bayesian Neural Networks

Sneiderman, Robby

ML Methods

Bayesian Neural Networks

Place a prior over neural network weights and compute the posterior given data. Exact inference is intractable, so we approximate: variational inference, MC dropout, Laplace approximation, SWAG. Principled uncertainty, high cost, limited scaling evidence.

AdvancedTier 3CurrentSupporting~55 min

Prerequisites

Bayesian Estimation Feedforward Networks and Backpropagation Gaussian Processes for ML No U Turn Sampler and Neals Funnel

Start 8-question practice · 2 available Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 3 | tier 3. This page has 4 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

A standard neural network produces a point prediction. It tells you "the probability of class A is 0.87" but does not tell you how confident the model is in that 0.87. A Bayesian neural network (BNN) maintains a distribution over weights, producing a distribution over predictions. This gives principled uncertainty estimates: the model can distinguish "I am confident" from "I have not seen data like this."

Uncertainty quantification matters in high-stakes applications (medical diagnosis, autonomous driving) where knowing what you do not know is as important as getting the prediction right. BNNs build on Bayesian estimation and feedforward network foundations.

Formal Setup

Definition

Bayesian Neural Network

A BNN consists of:

A prior distribution $p(w)$ over weights (e.g., Gaussian: $w \sim \mathcal{N}(0, \sigma_0^2 I)$ )
A likelihood $p(\mathcal{D} \mid w) = \prod_{i=1}^N p(y_i \mid x_i, w)$ defined by the network
The posterior $p(w \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid w) p(w)}{p(\mathcal{D})}$

Prediction for a new input $x^*$ :

$p(y^* \mid x^*, \mathcal{D}) = \int p(y^* \mid x^*, w) \, p(w \mid \mathcal{D}) \, dw$

This integral averages predictions over all plausible weight configurations.

The Intractability Problem

Proposition

Posterior Intractability for Neural Networks

Statement

The marginal likelihood $p(\mathcal{D}) = \int p(\mathcal{D} \mid w) p(w) \, dw$ has no closed-form expression for neural networks with nonlinear activations. Therefore the posterior $p(w \mid \mathcal{D})$ cannot be computed exactly.

Intuition

The likelihood $p(\mathcal{D} \mid w)$ is a complex nonlinear function of $w$ (it involves composing affine maps and nonlinearities). Multiplying by a Gaussian prior and integrating over a high-dimensional space yields an integral that cannot be evaluated analytically. This is unlike linear regression, where the posterior is Gaussian in closed form.

Proof Sketch

For a single hidden layer network with ReLU activation, the likelihood as a function of $w$ is piecewise linear (with exponentially many pieces in the number of hidden units). The integral of the product of a piecewise linear function and a Gaussian has no closed form in general when the number of pieces is exponential.

Why It Matters

This intractability is the reason BNNs require approximations. Every BNN method is defined by its approximation to this posterior integral. The quality of the approximation determines the quality of the uncertainty estimates.

Failure Mode

For linear models with Gaussian priors and likelihoods, the posterior is Gaussian and can be computed exactly. The intractability is specific to nonlinear models.

report a correction →

Approximation Methods

Variational Inference (Bayes by Backprop)

Choose a tractable family $q_\theta(w)$ (e.g., mean-field Gaussian: each weight has independent Gaussian $q(w_i) = \mathcal{N}(\mu_i, \sigma_i^2)$ ). Minimize the KL divergence to the true posterior:

$\theta^* = \arg\min_\theta \text{KL}(q_\theta(w) \| p(w \mid \mathcal{D}))$

This is equivalent to maximizing the ELBO (Evidence Lower Bound):

$\text{ELBO}(\theta) = \mathbb{E}_{q_\theta}[\log p(\mathcal{D} \mid w)] - \text{KL}(q_\theta(w) \| p(w))$

Blundell et al. (2015) showed this can be optimized with standard backpropagation using the reparameterization trick: sample $\epsilon \sim \mathcal{N}(0, I)$ , set $w = \mu + \sigma \odot \epsilon$ , and differentiate through the sampling.

The cost is roughly 2x a standard forward pass (one for sampling $w$ , one for computing the KL term) plus doubled parameter count (storing both $\mu$ and $\sigma$ ).

MC Dropout as Approximate Inference

Proposition

MC Dropout as Variational Inference

Statement

Gal and Ghahramani (2016) showed that a network with dropout applied at test time is equivalent to variational inference with a specific approximate posterior. Running $T$ forward passes with different dropout masks and averaging the predictions approximates the predictive integral:

$p(y^* \mid x^*, \mathcal{D}) \approx \frac{1}{T} \sum_{t=1}^{T} p(y^* \mid x^*, \hat{w}_t)$

where $\hat{w}_t$ is the weight vector with dropout mask $t$ applied.

Intuition

Dropout randomly zeros out weights, producing a different subnetwork each time. Averaging over many subnetworks is like averaging over a distribution of weights. The variance across the $T$ predictions serves as an uncertainty estimate.

Proof Sketch

Define the variational distribution as $q(w) = \prod_i (p \cdot \delta(w_i - m_i) + (1-p) \cdot \delta(w_i))$ where $m_i$ are the trained weights and $p$ is the dropout keep probability. Gal and Ghahramani showed the ELBO for this $q$ matches the dropout training objective (up to a constant related to the weight decay coefficient).

Why It Matters

MC dropout is the cheapest way to get uncertainty estimates from an existing network. You do not need to change the training procedure, only apply dropout at test time and run multiple forward passes.

Failure Mode

The variational family is restricted: the approximate posterior is a mixture of delta functions, not a smooth Gaussian. The quality of uncertainty estimates depends on the dropout rate, and the connection to the "correct" posterior is loose for deep networks. Empirically, MC dropout uncertainty tends to be poorly calibrated without additional tuning.

report a correction →

Laplace Approximation

Fit a Gaussian centered at the MAP estimate $w_{\text{MAP}}$ with covariance equal to the inverse Hessian of the negative log posterior:

$q(w) = \mathcal{N}(w_{\text{MAP}}, H^{-1})$

where $H = -\nabla^2 \log p(w \mid \mathcal{D})|_{w_{\text{MAP}}}$ .

This is cheap once you have the MAP estimate (which is just a trained network with weight decay). The challenge is computing or approximating $H^{-1}$ for large networks. Daxberger et al. (2021) showed that applying the Laplace approximation only to the last layer is often sufficient and computationally tractable.

SWAG (Stochastic Weight Averaging Gaussian)

Maddox et al. (2019): collect weight snapshots during SGD training (after learning rate has stabilized). Fit a Gaussian to these snapshots:

$q(w) = \mathcal{N}(\bar{w}, \Sigma_{\text{low-rank}} + \text{diag})$

where $\bar{w}$ is the average weight vector and $\Sigma$ is a low-rank plus diagonal approximation to the covariance of the SGD trajectory. This requires no changes to the training procedure beyond saving checkpoints.

Comparison of Methods

Method	Extra training cost	Extra test cost	Parameters
Bayes by Backprop	2x	$T \times$ forward	2x
MC Dropout	None	$T \times$ forward	None
Laplace (last layer)	Hessian computation	1 forward + sampling	Hessian storage
SWAG	Checkpoint storage	$T \times$ forward	Low-rank covariance

Why BNNs Are Not Widely Used

Computational cost: even cheap approximations (MC dropout) require multiple forward passes at test time.
Prior specification: $p(w) = \mathcal{N}(0, \sigma_0^2 I)$ is standard but has no principled justification for neural networks.
Limited scaling evidence: BNNs have not been convincingly demonstrated to improve over deep ensembles at the scale of modern LLMs.
Calibration is hard: the approximate posterior may give overconfident or underconfident uncertainty estimates.

Deep ensembles (Lakshminarayanan et al. 2017), which train $M$ independent networks and average predictions, often provide better uncertainty estimates than BNNs despite having no Bayesian interpretation.

Common Confusions

Watch Out

A BNN is not the same as a Bayesian treatment of hyperparameters

A BNN places a distribution over the weights (parameters). Bayesian hyperparameter optimization (e.g., choosing the learning rate) is a different problem. You can do one without the other.

Watch Out

High variance across MC samples does not always mean the model is uncertain about the input

High variance in MC dropout predictions can indicate model uncertainty, but it can also indicate that the dropout approximation is poor. The variance is a property of the approximate posterior, not necessarily of the true posterior.

Exercises

ExerciseCore

Problem

A BNN uses a Gaussian prior $p(w) = \mathcal{N}(0, \sigma_0^2 I)$ and is trained by maximizing the ELBO. Show that the KL term $\text{KL}(q_\theta(w) \| p(w))$ acts as a regularizer. What standard regularization technique does it correspond to when $q$ is a delta function?

ExerciseAdvanced

Problem

MC dropout uses $T$ forward passes to estimate the predictive distribution. For a regression problem, the predictive mean is $\bar{y} = \frac{1}{T} \sum_t f(x^*; \hat{w}_t)$ and the predictive variance is $\text{Var} \approx \frac{1}{T} \sum_t f(x^*; \hat{w}_t)^2 - \bar{y}^2 + \tau^{-1}$ where $\tau^{-1}$ is the observation noise. Explain why the first two terms capture epistemic uncertainty and the last term captures aleatoric uncertainty.

References

Canonical:

MacKay, "A Practical Bayesian Framework for Backpropagation Networks", Neural Computation 1992
Neal, Bayesian Learning for Neural Networks, Springer 1996, Chapters 1-4

Variational inference:

Blundell et al., "Weight Uncertainty in Neural Networks" (Bayes by Backprop), ICML 2015. arXiv:1505.05424
Kingma & Welling, "Auto-Encoding Variational Bayes" (2013). arXiv:1312.6114 (reparameterization trick)
Gal & Ghahramani, "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning", ICML 2016. arXiv:1506.02142

Scalable approximations:

Maddox et al., "A Simple Baseline for Bayesian Inference in Deep Learning" (SWAG), NeurIPS 2019. arXiv:1902.02476
Daxberger et al., "Laplace Redux: Effortless Bayesian Deep Learning", NeurIPS 2021. arXiv:2106.14806

Modern perspective:

Lakshminarayanan, Pritzel, Blundell, "Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles", NeurIPS 2017. arXiv:1612.01474
Wilson & Izmailov, "Bayesian Deep Learning and a Probabilistic Perspective of Generalization", NeurIPS 2020. arXiv:2002.08791

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Feedforward Networks and Backpropagationlayer 2 · tier 1
Bayesian Estimationlayer 0B · tier 2
No-U-Turn Sampler and Neal's Funnellayer 3 · tier 2
Gaussian Processes for Machine Learninglayer 4 · tier 3

Derived topics

0

No published topic currently declares this as a prerequisite.