Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Statistical Estimation

Bayesian Estimation

The Bayesian approach to parameter estimation: encode prior beliefs, update with data via Bayes rule, and obtain a full posterior distribution over parameters. Conjugate priors, MAP estimation, and the Bernstein-von Mises theorem showing that the posterior concentrates around the true parameter.

CoreTier 2Stable~60 min

Why This Matters

Maximum likelihood estimation gives you a single point estimate of the parameter. Bayesian estimation gives you a full distribution over the parameter. This distribution tells you not just "what is the best guess?" but "how uncertain am I, and in what directions?"

Bayesian methods dominate when you have small samples, informative prior knowledge, or need to quantify uncertainty. They are the foundation of Gaussian processes, Bayesian neural networks, and modern probabilistic programming. The Bernstein-von Mises theorem provides the bridge back to frequentist theory: as the sample size grows, the posterior converges to a Gaussian centered at the MLE.

Mental Model

You start with a prior belief p(θ)p(\theta) about the parameter before seeing any data. You observe data x\mathbf{x}. Bayes rule updates your belief to the posterior p(θx)p(\theta | \mathbf{x}). The posterior is a compromise between the prior and the likelihood: with little data, the prior dominates; with lots of data, the likelihood dominates and the prior is washed out.

Think of the prior as your starting position and the data as a force that pulls you toward the truth. The posterior is where you end up after the pull. Ignoring the prior leads to the base rate fallacy, one of the most common reasoning errors in applied probability.

Formal Setup and Notation

Let θΘ\theta \in \Theta be a parameter, p(θ)p(\theta) a prior distribution, and p(xθ)p(\mathbf{x} | \theta) the likelihood of observing data x\mathbf{x} given θ\theta.

Definition

Posterior Distribution

The posterior distribution is given by Bayes rule:

p(θx)=p(xθ)p(θ)p(x)p(\theta | \mathbf{x}) = \frac{p(\mathbf{x} | \theta) \, p(\theta)}{p(\mathbf{x})}

where p(x)=p(xθ)p(θ)dθp(\mathbf{x}) = \int p(\mathbf{x} | \theta) \, p(\theta) \, d\theta is the marginal likelihood (evidence). The posterior is proportional to prior times likelihood:

p(θx)p(θ)p(xθ)p(\theta | \mathbf{x}) \propto p(\theta) \cdot p(\mathbf{x} | \theta)

Definition

MAP Estimation

The maximum a posteriori (MAP) estimator is the mode of the posterior:

θ^MAP=argmaxθ  p(θx)=argmaxθ  [logp(xθ)+logp(θ)]\hat{\theta}_{\text{MAP}} = \arg\max_\theta \; p(\theta | \mathbf{x}) = \arg\max_\theta \; [\log p(\mathbf{x} | \theta) + \log p(\theta)]

MAP is like MLE with a regularization term logp(θ)\log p(\theta). With a Gaussian prior θN(0,τ2)\theta \sim \mathcal{N}(0, \tau^2), MAP is equivalent to 2\ell_2-regularized MLE (ridge regression).

Conjugate Priors

A prior p(θ)p(\theta) is conjugate to a likelihood p(xθ)p(\mathbf{x}|\theta) if the posterior p(θx)p(\theta|\mathbf{x}) belongs to the same family as the prior. Conjugacy makes Bayesian updates analytically tractable.

Proposition

Conjugate Prior Updates

Statement

The three most important conjugate pairs are:

Beta-Binomial. Prior: θBeta(α,β)\theta \sim \text{Beta}(\alpha, \beta). Likelihood: kk successes in nn Bernoulli trials. Posterior: θkBeta(α+k,β+nk)\theta | k \sim \text{Beta}(\alpha + k, \beta + n - k).

Normal-Normal. Prior: μN(μ0,σ02)\mu \sim \mathcal{N}(\mu_0, \sigma_0^2). Likelihood: x1,,xnN(μ,σ2)x_1, \ldots, x_n \sim \mathcal{N}(\mu, \sigma^2) with known σ2\sigma^2. Posterior: μxN(μn,σn2)\mu | \mathbf{x} \sim \mathcal{N}(\mu_n, \sigma_n^2) where:

μn=σ2μ0+nσ02xˉσ2+nσ02,σn2=σ2σ02σ2+nσ02\mu_n = \frac{\sigma^2 \mu_0 + n \sigma_0^2 \bar{x}}{\sigma^2 + n \sigma_0^2}, \qquad \sigma_n^2 = \frac{\sigma^2 \sigma_0^2}{\sigma^2 + n \sigma_0^2}

Gamma-Poisson. Prior: λGamma(α,β)\lambda \sim \text{Gamma}(\alpha, \beta). Likelihood: x1,,xnPoisson(λ)x_1, \ldots, x_n \sim \text{Poisson}(\lambda). Posterior: λxGamma(α+xi,β+n)\lambda | \mathbf{x} \sim \text{Gamma}(\alpha + \sum x_i, \beta + n).

Intuition

In each case, the prior parameters act as "pseudo-observations." For the beta-binomial, α\alpha acts like prior successes and β\beta like prior failures. The posterior combines these with the actual observed counts. As nn \to \infty, the prior contribution vanishes and the posterior concentrates on the MLE.

Why It Matters

Conjugate priors are the workhorses of practical Bayesian analysis. They give closed-form posteriors, making inference fast and interpretable. Even when the true prior is not conjugate, conjugate families serve as tractable approximations.

The Normal-Normal Update in Detail

The normal-normal case is particularly revealing. The posterior mean is a precision-weighted average of the prior mean and the data mean:

μn=τ0τ0+nτμ0+nττ0+nτxˉ\mu_n = \frac{\tau_0}{\tau_0 + n\tau} \mu_0 + \frac{n\tau}{\tau_0 + n\tau} \bar{x}

where τ0=1/σ02\tau_0 = 1/\sigma_0^2 is the prior precision and τ=1/σ2\tau = 1/\sigma^2 is the data precision per observation. The posterior precision is the sum τ0+nτ\tau_0 + n\tau. More data means more precision. A tighter prior (large τ0\tau_0) means the prior has more influence.

With n=0n = 0 (no data), the posterior equals the prior. As nn \to \infty, the posterior mean approaches xˉ\bar{x} and the posterior variance approaches σ2/n\sigma^2/n, recovering the MLE and its sampling distribution.

Posterior Consistency: Bernstein-von Mises

Theorem

Bernstein-von Mises Theorem

Statement

Under regularity conditions, as nn \to \infty, the posterior distribution converges in total variation to a Gaussian centered at the MLE:

p(θxn)N ⁣(θ^MLE,1nI(θ0))TVP0\left\| p(\theta | \mathbf{x}_n) - \mathcal{N}\!\left(\hat{\theta}_{\text{MLE}}, \frac{1}{n I(\theta_0)}\right) \right\|_{\text{TV}} \xrightarrow{P} 0

where I(θ0)I(\theta_0) is the Fisher information at the true parameter. The posterior concentrates around the true parameter at rate 1/n1/\sqrt{n} and its width matches the frequentist standard error.

Intuition

With enough data, the likelihood overwhelms the prior and the posterior becomes approximately Gaussian centered at the MLE. The prior does not matter asymptotically (as long as it puts positive mass near the truth). Bayesian and frequentist inference agree in the large-sample limit.

Proof Sketch

Expand the log-posterior around the MLE using a Taylor expansion. The log-likelihood term dominates the log-prior for large nn. The quadratic approximation to the log-likelihood gives a Gaussian with precision nI(θ0)nI(\theta_0). The prior contributes a term of order O(1)O(1) that becomes negligible compared to the O(n)O(n) likelihood term.

Why It Matters

This theorem bridges Bayesian and frequentist statistics. It justifies using Bayesian credible intervals as approximate confidence intervals in large samples. It also shows that Bayesian inference is consistent: the posterior concentrates on the truth.

Failure Mode

The theorem requires the model to be correctly specified (θ0\theta_0 is in the model). If the model is misspecified, the posterior concentrates on the KL-closest parameter to the truth, not the truth itself. The theorem also fails for non-regular models, infinite-dimensional parameters (nonparametric Bayesian models require separate theory), and improper priors in some cases.

Credible Intervals vs Confidence Intervals

A 95% credible interval [a,b][a, b] satisfies P(θ[a,b]x)=0.95P(\theta \in [a,b] | \mathbf{x}) = 0.95. It is a statement about θ\theta given the data.

A 95% confidence interval [a,b][a, b] satisfies: if you repeat the experiment many times, 95% of the intervals will contain the true θ\theta. It is a statement about the procedure, not about θ\theta for the observed data.

These are structurally different interpretations. However, by Bernstein-von Mises, Bayesian credible intervals and frequentist confidence intervals coincide asymptotically. In finite samples they can differ, especially when the prior is informative.

When Bayesian is Better

Bayesian estimation shines in several settings:

  • Small samples. The prior regularizes estimation when data is scarce. Without a prior, MLE can overfit or be undefined
  • Informative priors. When domain knowledge constrains the parameter (e.g., a probability must be near 0.5, a physical constant is known approximately), the prior encodes this and improves the estimate
  • Uncertainty quantification. The full posterior gives calibrated uncertainty bands, not just a point estimate. This is critical for decision making under uncertainty
  • Hierarchical models. Bayesian methods naturally handle multi-level structure where parameters at one level serve as priors for the next

Common Confusions

Watch Out

The prior is not arbitrary

A common criticism is that the prior is "subjective." But the prior can be chosen systematically: use domain knowledge, previous studies, or weakly-informative priors that regularize without strongly constraining. By Bernstein-von Mises, reasonable priors all lead to the same posterior with enough data. The choice matters most when data is scarce, which is exactly when you should use prior knowledge. The Monty Hall problem is a classic example where ignoring the prior (uniform over doors) and failing to update correctly leads to the wrong answer.

Watch Out

MAP is not full Bayesian inference

MAP gives a point estimate (the posterior mode). Full Bayesian inference uses the entire posterior distribution. MAP ignores posterior uncertainty and can give misleading results when the posterior is skewed or multimodal. Use the posterior mean or full posterior for uncertainty quantification.

Watch Out

Flat priors are not always noninformative

A flat (uniform) prior on θ\theta is not flat on θ2\theta^2 or logθ\log\theta. The notion of "noninformative" depends on the parameterization. Jeffreys prior p(θ)I(θ)p(\theta) \propto \sqrt{I(\theta)} is invariant to reparameterization but can be improper (not integrating to 1). Reference priors and weakly-informative priors are more practical alternatives.

Summary

  • Posterior \propto prior ×\times likelihood. The posterior is a compromise between prior belief and observed data
  • Conjugate priors give closed-form posteriors: beta-binomial, normal-normal, gamma-Poisson
  • MAP estimation adds logp(θ)\log p(\theta) to the log-likelihood, equivalent to regularization
  • Bernstein-von Mises: the posterior converges to N(θ^MLE,1/(nI(θ0)))\mathcal{N}(\hat{\theta}_{\text{MLE}}, 1/(nI(\theta_0))) as nn \to \infty
  • Credible intervals are probability statements about θ\theta; confidence intervals are probability statements about the procedure
  • Bayesian methods are best when data is scarce, priors are informative, or you need full uncertainty quantification

Exercises

ExerciseCore

Problem

You have a coin that you believe is roughly fair. You choose a Beta(10,10)\text{Beta}(10, 10) prior for the probability θ\theta of heads. You flip the coin 20 times and observe 15 heads. What is the posterior distribution? What is the posterior mean?

ExerciseAdvanced

Problem

Show that MAP estimation with a Gaussian prior θN(0,τ2)\theta \sim \mathcal{N}(0, \tau^2) and Gaussian likelihood xiN(θ,σ2)x_i \sim \mathcal{N}(\theta, \sigma^2) is equivalent to ridge regression. What is the effective regularization parameter λ\lambda in terms of σ2\sigma^2 and τ2\tau^2?

ExerciseResearch

Problem

The Bernstein-von Mises theorem requires the model to be correctly specified. What happens to the posterior when the model is misspecified? Give a concrete example where the posterior concentrates on a parameter value that is not the "true" parameter, and explain what that parameter represents.

References

Canonical:

  • Berger, Statistical Decision Theory and Bayesian Analysis (1985)
  • Gelman et al., Bayesian Data Analysis (3rd ed., 2013), Chapters 1-3

Current:

  • McElreath, Statistical Rethinking (2nd ed., 2020)

  • van der Vaart, Asymptotic Statistics (1998), Chapter 10 (Bernstein-von Mises)

  • Casella & Berger, Statistical Inference (2002), Chapters 5-10

  • Lehmann & Casella, Theory of Point Estimation (1998), Chapters 1-6

Next Topics

The natural next steps from Bayesian estimation:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics