Statistical Estimation
Bayesian Estimation
The Bayesian approach to parameter estimation: encode prior beliefs, update with data via Bayes rule, and obtain a full posterior distribution over parameters. Conjugate priors, MAP estimation, and the Bernstein-von Mises theorem showing that the posterior concentrates around the true parameter.
Why This Matters
Maximum likelihood estimation gives you a single point estimate of the parameter. Bayesian estimation gives you a full distribution over the parameter. This distribution tells you not just "what is the best guess?" but "how uncertain am I, and in what directions?"
Bayesian methods dominate when you have small samples, informative prior knowledge, or need to quantify uncertainty. They are the foundation of Gaussian processes, Bayesian neural networks, and modern probabilistic programming. The Bernstein-von Mises theorem provides the bridge back to frequentist theory: as the sample size grows, the posterior converges to a Gaussian centered at the MLE.
Mental Model
You start with a prior belief about the parameter before seeing any data. You observe data . Bayes rule updates your belief to the posterior . The posterior is a compromise between the prior and the likelihood: with little data, the prior dominates; with lots of data, the likelihood dominates and the prior is washed out.
Think of the prior as your starting position and the data as a force that pulls you toward the truth. The posterior is where you end up after the pull. Ignoring the prior leads to the base rate fallacy, one of the most common reasoning errors in applied probability.
Formal Setup and Notation
Let be a parameter, a prior distribution, and the likelihood of observing data given .
Posterior Distribution
The posterior distribution is given by Bayes rule:
where is the marginal likelihood (evidence). The posterior is proportional to prior times likelihood:
MAP Estimation
The maximum a posteriori (MAP) estimator is the mode of the posterior:
MAP is like MLE with a regularization term . With a Gaussian prior , MAP is equivalent to -regularized MLE (ridge regression).
Conjugate Priors
A prior is conjugate to a likelihood if the posterior belongs to the same family as the prior. Conjugacy makes Bayesian updates analytically tractable.
Conjugate Prior Updates
Statement
The three most important conjugate pairs are:
Beta-Binomial. Prior: . Likelihood: successes in Bernoulli trials. Posterior: .
Normal-Normal. Prior: . Likelihood: with known . Posterior: where:
Gamma-Poisson. Prior: . Likelihood: . Posterior: .
Intuition
In each case, the prior parameters act as "pseudo-observations." For the beta-binomial, acts like prior successes and like prior failures. The posterior combines these with the actual observed counts. As , the prior contribution vanishes and the posterior concentrates on the MLE.
Why It Matters
Conjugate priors are the workhorses of practical Bayesian analysis. They give closed-form posteriors, making inference fast and interpretable. Even when the true prior is not conjugate, conjugate families serve as tractable approximations.
The Normal-Normal Update in Detail
The normal-normal case is particularly revealing. The posterior mean is a precision-weighted average of the prior mean and the data mean:
where is the prior precision and is the data precision per observation. The posterior precision is the sum . More data means more precision. A tighter prior (large ) means the prior has more influence.
With (no data), the posterior equals the prior. As , the posterior mean approaches and the posterior variance approaches , recovering the MLE and its sampling distribution.
Posterior Consistency: Bernstein-von Mises
Bernstein-von Mises Theorem
Statement
Under regularity conditions, as , the posterior distribution converges in total variation to a Gaussian centered at the MLE:
where is the Fisher information at the true parameter. The posterior concentrates around the true parameter at rate and its width matches the frequentist standard error.
Intuition
With enough data, the likelihood overwhelms the prior and the posterior becomes approximately Gaussian centered at the MLE. The prior does not matter asymptotically (as long as it puts positive mass near the truth). Bayesian and frequentist inference agree in the large-sample limit.
Proof Sketch
Expand the log-posterior around the MLE using a Taylor expansion. The log-likelihood term dominates the log-prior for large . The quadratic approximation to the log-likelihood gives a Gaussian with precision . The prior contributes a term of order that becomes negligible compared to the likelihood term.
Why It Matters
This theorem bridges Bayesian and frequentist statistics. It justifies using Bayesian credible intervals as approximate confidence intervals in large samples. It also shows that Bayesian inference is consistent: the posterior concentrates on the truth.
Failure Mode
The theorem requires the model to be correctly specified ( is in the model). If the model is misspecified, the posterior concentrates on the KL-closest parameter to the truth, not the truth itself. The theorem also fails for non-regular models, infinite-dimensional parameters (nonparametric Bayesian models require separate theory), and improper priors in some cases.
Credible Intervals vs Confidence Intervals
A 95% credible interval satisfies . It is a statement about given the data.
A 95% confidence interval satisfies: if you repeat the experiment many times, 95% of the intervals will contain the true . It is a statement about the procedure, not about for the observed data.
These are structurally different interpretations. However, by Bernstein-von Mises, Bayesian credible intervals and frequentist confidence intervals coincide asymptotically. In finite samples they can differ, especially when the prior is informative.
When Bayesian is Better
Bayesian estimation shines in several settings:
- Small samples. The prior regularizes estimation when data is scarce. Without a prior, MLE can overfit or be undefined
- Informative priors. When domain knowledge constrains the parameter (e.g., a probability must be near 0.5, a physical constant is known approximately), the prior encodes this and improves the estimate
- Uncertainty quantification. The full posterior gives calibrated uncertainty bands, not just a point estimate. This is critical for decision making under uncertainty
- Hierarchical models. Bayesian methods naturally handle multi-level structure where parameters at one level serve as priors for the next
Common Confusions
The prior is not arbitrary
A common criticism is that the prior is "subjective." But the prior can be chosen systematically: use domain knowledge, previous studies, or weakly-informative priors that regularize without strongly constraining. By Bernstein-von Mises, reasonable priors all lead to the same posterior with enough data. The choice matters most when data is scarce, which is exactly when you should use prior knowledge. The Monty Hall problem is a classic example where ignoring the prior (uniform over doors) and failing to update correctly leads to the wrong answer.
MAP is not full Bayesian inference
MAP gives a point estimate (the posterior mode). Full Bayesian inference uses the entire posterior distribution. MAP ignores posterior uncertainty and can give misleading results when the posterior is skewed or multimodal. Use the posterior mean or full posterior for uncertainty quantification.
Flat priors are not always noninformative
A flat (uniform) prior on is not flat on or . The notion of "noninformative" depends on the parameterization. Jeffreys prior is invariant to reparameterization but can be improper (not integrating to 1). Reference priors and weakly-informative priors are more practical alternatives.
Summary
- Posterior prior likelihood. The posterior is a compromise between prior belief and observed data
- Conjugate priors give closed-form posteriors: beta-binomial, normal-normal, gamma-Poisson
- MAP estimation adds to the log-likelihood, equivalent to regularization
- Bernstein-von Mises: the posterior converges to as
- Credible intervals are probability statements about ; confidence intervals are probability statements about the procedure
- Bayesian methods are best when data is scarce, priors are informative, or you need full uncertainty quantification
Exercises
Problem
You have a coin that you believe is roughly fair. You choose a prior for the probability of heads. You flip the coin 20 times and observe 15 heads. What is the posterior distribution? What is the posterior mean?
Problem
Show that MAP estimation with a Gaussian prior and Gaussian likelihood is equivalent to ridge regression. What is the effective regularization parameter in terms of and ?
Problem
The Bernstein-von Mises theorem requires the model to be correctly specified. What happens to the posterior when the model is misspecified? Give a concrete example where the posterior concentrates on a parameter value that is not the "true" parameter, and explain what that parameter represents.
References
Canonical:
- Berger, Statistical Decision Theory and Bayesian Analysis (1985)
- Gelman et al., Bayesian Data Analysis (3rd ed., 2013), Chapters 1-3
Current:
-
McElreath, Statistical Rethinking (2nd ed., 2020)
-
van der Vaart, Asymptotic Statistics (1998), Chapter 10 (Bernstein-von Mises)
-
Casella & Berger, Statistical Inference (2002), Chapters 5-10
-
Lehmann & Casella, Theory of Point Estimation (1998), Chapters 1-6
Next Topics
The natural next steps from Bayesian estimation:
- Gaussian processes for ML: nonparametric Bayesian inference over functions
- Variational autoencoders: variational inference when the posterior is intractable
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Differentiation in RnLayer 0A
Builds on This
- Anthropic Bias and Observation SelectionLayer 3
- Bayesian Neural NetworksLayer 3
- Bayesian State EstimationLayer 2
- Causal Inference and the Ladder of CausationLayer 3
- Decision Theory FoundationsLayer 2
- Detection TheoryLayer 2
- Meta-AnalysisLayer 2
- PAC-Bayes BoundsLayer 3
- Small Area EstimationLayer 3