ML Methods
Bayesian Neural Networks
Place a prior over neural network weights and compute the posterior given data. Exact inference is intractable, so we approximate: variational inference, MC dropout, Laplace approximation, SWAG. Principled uncertainty, high cost, limited scaling evidence.
Why This Matters
A standard neural network produces a point prediction. It tells you "the probability of class A is 0.87" but does not tell you how confident the model is in that 0.87. A Bayesian neural network (BNN) maintains a distribution over weights, producing a distribution over predictions. This gives principled uncertainty estimates: the model can distinguish "I am confident" from "I have not seen data like this."
Uncertainty quantification matters in high-stakes applications (medical diagnosis, autonomous driving) where knowing what you do not know is as important as getting the prediction right. BNNs build on Bayesian estimation and feedforward network foundations.
Formal Setup
Bayesian Neural Network
A BNN consists of:
- A prior distribution over weights (e.g., Gaussian: )
- A likelihood defined by the network
- The posterior
Prediction for a new input :
This integral averages predictions over all plausible weight configurations.
The Intractability Problem
Posterior Intractability for Neural Networks
Statement
The marginal likelihood has no closed-form expression for neural networks with nonlinear activations. Therefore the posterior cannot be computed exactly.
Intuition
The likelihood is a complex nonlinear function of (it involves composing affine maps and nonlinearities). Multiplying by a Gaussian prior and integrating over a high-dimensional space yields an integral that cannot be evaluated analytically. This is unlike linear regression, where the posterior is Gaussian in closed form.
Proof Sketch
For a single hidden layer network with ReLU activation, the likelihood as a function of is piecewise linear (with exponentially many pieces in the number of hidden units). The integral of the product of a piecewise linear function and a Gaussian has no closed form in general when the number of pieces is exponential.
Why It Matters
This intractability is the reason BNNs require approximations. Every BNN method is defined by its approximation to this posterior integral. The quality of the approximation determines the quality of the uncertainty estimates.
Failure Mode
For linear models with Gaussian priors and likelihoods, the posterior is Gaussian and can be computed exactly. The intractability is specific to nonlinear models.
Approximation Methods
Variational Inference (Bayes by Backprop)
Choose a tractable family (e.g., mean-field Gaussian: each weight has independent Gaussian ). Minimize the KL divergence to the true posterior:
This is equivalent to maximizing the ELBO (Evidence Lower Bound):
Blundell et al. (2015) showed this can be optimized with standard backpropagation using the reparameterization trick: sample , set , and differentiate through the sampling.
The cost is roughly 2x a standard forward pass (one for sampling , one for computing the KL term) plus doubled parameter count (storing both and ).
MC Dropout as Approximate Inference
MC Dropout as Variational Inference
Statement
Gal and Ghahramani (2016) showed that a network with dropout applied at test time is equivalent to variational inference with a specific approximate posterior. Running forward passes with different dropout masks and averaging the predictions approximates the predictive integral:
where is the weight vector with dropout mask applied.
Intuition
Dropout randomly zeros out weights, producing a different subnetwork each time. Averaging over many subnetworks is like averaging over a distribution of weights. The variance across the predictions serves as an uncertainty estimate.
Proof Sketch
Define the variational distribution as where are the trained weights and is the dropout keep probability. Gal and Ghahramani showed the ELBO for this matches the dropout training objective (up to a constant related to the weight decay coefficient).
Why It Matters
MC dropout is the cheapest way to get uncertainty estimates from an existing network. You do not need to change the training procedure, only apply dropout at test time and run multiple forward passes.
Failure Mode
The variational family is restricted: the approximate posterior is a mixture of delta functions, not a smooth Gaussian. The quality of uncertainty estimates depends on the dropout rate, and the connection to the "correct" posterior is loose for deep networks. Empirically, MC dropout uncertainty tends to be poorly calibrated without additional tuning.
Laplace Approximation
Fit a Gaussian centered at the MAP estimate with covariance equal to the inverse Hessian of the negative log posterior:
where .
This is cheap once you have the MAP estimate (which is just a trained network with weight decay). The challenge is computing or approximating for large networks. Daxberger et al. (2021) showed that applying the Laplace approximation only to the last layer is often sufficient and computationally tractable.
SWAG (Stochastic Weight Averaging Gaussian)
Maddox et al. (2019): collect weight snapshots during SGD training (after learning rate has stabilized). Fit a Gaussian to these snapshots:
where is the average weight vector and is a low-rank plus diagonal approximation to the covariance of the SGD trajectory. This requires no changes to the training procedure beyond saving checkpoints.
Comparison of Methods
| Method | Extra training cost | Extra test cost | Parameters |
|---|---|---|---|
| Bayes by Backprop | 2x | forward | 2x |
| MC Dropout | None | forward | None |
| Laplace (last layer) | Hessian computation | 1 forward + sampling | Hessian storage |
| SWAG | Checkpoint storage | forward | Low-rank covariance |
Why BNNs Are Not Widely Used
- Computational cost: even cheap approximations (MC dropout) require multiple forward passes at test time.
- Prior specification: is standard but has no principled justification for neural networks.
- Limited scaling evidence: BNNs have not been convincingly demonstrated to improve over deep ensembles at the scale of modern LLMs.
- Calibration is hard: the approximate posterior may give overconfident or underconfident uncertainty estimates.
Deep ensembles (Lakshminarayanan et al. 2017), which train independent networks and average predictions, often provide better uncertainty estimates than BNNs despite having no Bayesian interpretation.
Common Confusions
A BNN is not the same as a Bayesian treatment of hyperparameters
A BNN places a distribution over the weights (parameters). Bayesian hyperparameter optimization (e.g., choosing the learning rate) is a different problem. You can do one without the other.
High variance across MC samples does not always mean the model is uncertain about the input
High variance in MC dropout predictions can indicate model uncertainty, but it can also indicate that the dropout approximation is poor. The variance is a property of the approximate posterior, not necessarily of the true posterior.
Exercises
Problem
A BNN uses a Gaussian prior and is trained by maximizing the ELBO. Show that the KL term acts as a regularizer. What standard regularization technique does it correspond to when is a delta function?
Problem
MC dropout uses forward passes to estimate the predictive distribution. For a regression problem, the predictive mean is and the predictive variance is where is the observation noise. Explain why the first two terms capture epistemic uncertainty and the last term captures aleatoric uncertainty.
References
Canonical:
- MacKay, "A Practical Bayesian Framework for Backpropagation Networks", Neural Computation 1992
- Neal, Bayesian Learning for Neural Networks, Springer 1996, Chapters 1-4
Current:
- Blundell et al., "Weight Uncertainty in Neural Networks" (Bayes by Backprop), ICML 2015
- Gal and Ghahramani, "Dropout as a Bayesian Approximation", ICML 2016
- Maddox et al., "A Simple Baseline for Bayesian Inference in Deep Learning" (SWAG), NeurIPS 2019
- Daxberger et al., "Laplace Redux", NeurIPS 2021
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Bayesian EstimationLayer 0B
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Differentiation in RnLayer 0A
- Feedforward Networks and BackpropagationLayer 2
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A