Bayesian Estimation

Sneiderman, Robby

Statistical Estimation

Bayesian Estimation

The Bayesian approach to parameter estimation: encode prior beliefs, update with data via Bayes rule, and obtain a full posterior distribution over parameters. Conjugate priors, MAP estimation, and the Bernstein-von Mises theorem showing that the posterior concentrates around the true parameter.

CoreTier 2StableSupporting~30 min

Prerequisites

Maximum Likelihood Estimation Common Probability Distributions Joint Marginal Conditional Distributions Shrinkage Estimation James Stein

Start 8-question practice · 12 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

statistical-estimation | layer 0B | tier 2. This page has 4 direct prerequisites and 18 published dependents.

Open Atlas Prerequisites Leads to

What next

Gaussian Processes for Machine Learning

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Maximum likelihood estimation gives you a single point estimate of the parameter. Bayesian estimation gives you a full distribution over the parameter. This distribution tells you not just "what is the best guess?" but "how uncertain am I, and in what directions?"

Bayesian methods are most useful when you have small samples, defensible prior information, hierarchical structure, or decisions that depend on predictive uncertainty. They are the foundation of Gaussian processes, Bayesian neural networks, and modern probabilistic programming. The Bernstein-von Mises theorem provides the bridge back to frequentist theory: as the sample size grows, the posterior converges to a Gaussian centered at the MLE.

theorem visual

Prior, likelihood, posterior

$Bayes rule multiplies belief by evidence, then renormalizes. The posterior is not a point estimate; it is the updated uncertainty curve.$

starting belief

$Prior: θ \sim Beta (10, 10) . This encodes twenty pseudo-observations centered near a fair coin.$

data pull

$Observed data: 15 heads in 20 flips. The likelihood peaks near \hat{θ}_{MLE} = 0.75 .$

updated uncertainty

$Posterior: θ ∣ x \sim Beta (25, 15), with mean 25/40 = 0.625 .$

Mental Model

You start with a prior belief $p(\theta)$ about the parameter before seeing any data. You observe data $\mathbf{x}$ . Bayes rule updates your belief to the posterior $p(\theta | \mathbf{x})$ . The posterior is a compromise between the prior and the likelihood: with little data, the prior dominates; with lots of data, the likelihood dominates and the prior is washed out.

Think of the prior as your starting position and the data as a force that pulls you toward the truth. The posterior is where you end up after the pull. Ignoring the prior leads to the base rate fallacy, one of the most common reasoning errors in applied probability.

Formal Setup and Notation

Let $\theta \in \Theta$ be a parameter, $p(\theta)$ a prior distribution, and $p(\mathbf{x} | \theta)$ the likelihood of observing data $\mathbf{x}$ given $\theta$ .

Definition

Posterior Distribution

The posterior distribution is given by Bayes rule:

$p(\theta | \mathbf{x}) = \frac{p(\mathbf{x} | \theta) \, p(\theta)}{p(\mathbf{x})}$

where $p(\mathbf{x}) = \int p(\mathbf{x} | \theta) \, p(\theta) \, d\theta$ is the marginal likelihood (evidence). The posterior is proportional to prior times likelihood:

$p(\theta | \mathbf{x}) \propto p(\theta) \cdot p(\mathbf{x} | \theta)$

Definition

MAP Estimation

The maximum a posteriori (MAP) estimator is the mode of the posterior:

$\hat{\theta}_{\text{MAP}} = \arg\max_\theta \; p(\theta | \mathbf{x}) = \arg\max_\theta \; [\log p(\mathbf{x} | \theta) + \log p(\theta)]$

MAP is like MLE with a regularization term $\log p(\theta)$ . With a Gaussian prior $\theta \sim \mathcal{N}(0, \tau^2)$ , MAP is equivalent to $\ell_2$ -regularized MLE (ridge regression).

Conjugate Priors

A prior $p(\theta)$ is conjugate to a likelihood $p(\mathbf{x}|\theta)$ if and only if the posterior $p(\theta|\mathbf{x})$ belongs to the same family as the prior. Conjugacy makes Bayesian updates analytically tractable.

Proposition

Conjugate Prior Updates

Statement

The three most important conjugate pairs are:

Beta-Binomial. Prior: $\theta \sim \text{Beta}(\alpha, \beta)$ . Likelihood: $k$ successes in $n$ Bernoulli trials. Posterior: $\theta | k \sim \text{Beta}(\alpha + k, \beta + n - k)$ .

Normal-Normal. Prior: $\mu \sim \mathcal{N}(\mu_0, \sigma_0^2)$ . Likelihood: $x_1, \ldots, x_n \sim \mathcal{N}(\mu, \sigma^2)$ with known $\sigma^2$ . Posterior: $\mu | \mathbf{x} \sim \mathcal{N}(\mu_n, \sigma_n^2)$ where:

$\mu_n = \frac{\sigma^2 \mu_0 + n \sigma_0^2 \bar{x}}{\sigma^2 + n \sigma_0^2}, \qquad \sigma_n^2 = \frac{\sigma^2 \sigma_0^2}{\sigma^2 + n \sigma_0^2}$

Gamma-Poisson. Prior: $\lambda \sim \text{Gamma}(\alpha, \beta)$ . Likelihood: $x_1, \ldots, x_n \sim \text{Poisson}(\lambda)$ . Posterior: $\lambda | \mathbf{x} \sim \text{Gamma}(\alpha + \sum x_i, \beta + n)$ .

Intuition

In each case, the prior parameters act as "pseudo-observations." For the beta-binomial, $\alpha$ acts like prior successes and $\beta$ like prior failures. The posterior combines these with the actual observed counts. As $n \to \infty$ , the prior contribution vanishes and the posterior concentrates on the MLE.

Why It Matters

Conjugate priors are the workhorses of practical Bayesian analysis. They give closed-form posteriors, making inference fast and interpretable. Even when the true prior is not conjugate, conjugate families serve as tractable approximations.

report a correction →

The Normal-Normal Update in Detail

The normal-normal case is particularly revealing. The posterior mean is a precision-weighted average of the prior mean and the data mean:

$\mu_n = \frac{\tau_0}{\tau_0 + n\tau} \mu_0 + \frac{n\tau}{\tau_0 + n\tau} \bar{x}$

where $\tau_0 = 1/\sigma_0^2$ is the prior precision and $\tau = 1/\sigma^2$ is the data precision per observation. The posterior precision is the sum $\tau_0 + n\tau$ . More data means more precision. A tighter prior (large $\tau_0$ ) means the prior has more influence.

With $n = 0$ (no data), the posterior equals the prior. As $n \to \infty$ , the posterior mean approaches $\bar{x}$ and the posterior variance approaches $\sigma^2/n$ , recovering the MLE and its sampling distribution.

Posterior Consistency: Bernstein-von Mises

Theorem

Bernstein-von Mises Theorem

Statement

Under regularity conditions, as $n \to \infty$ , the posterior distribution converges in total variation to a Gaussian centered at the MLE:

$\left\| p(\theta | \mathbf{x}_n) - \mathcal{N}\!\left(\hat{\theta}_{\text{MLE}}, \frac{1}{n I(\theta_0)}\right) \right\|_{\text{TV}} \xrightarrow{P} 0$

where $I(\theta_0)$ is the Fisher information at the true parameter. The posterior concentrates around the true parameter at rate $1/\sqrt{n}$ and its width matches the frequentist standard error.

Intuition

With enough data, the likelihood overwhelms the prior and the posterior becomes approximately Gaussian centered at the MLE. The prior does not matter asymptotically (as long as it puts positive mass near the truth). Bayesian and frequentist inference agree in the large-sample limit.

Proof Sketch

Expand the log-posterior around the MLE using a Taylor expansion. The log-likelihood term dominates the log-prior for large $n$ . The quadratic approximation to the log-likelihood gives a Gaussian with precision $nI(\theta_0)$ . The prior contributes a term of order $O(1)$ that becomes negligible compared to the $O(n)$ likelihood term.

Why It Matters

This theorem bridges Bayesian and frequentist statistics. It justifies using Bayesian credible intervals as approximate confidence intervals in large samples. It also shows that Bayesian inference is consistent: the posterior concentrates on the truth.

Failure Mode

The theorem requires the model to be correctly specified ( $\theta_0$ is in the model). If the model is misspecified, the posterior concentrates on the KL-closest parameter to the truth, not the truth itself. The theorem also fails for non-regular models, infinite-dimensional parameters (nonparametric Bayesian models require separate theory), and improper priors in some cases.

report a correction →

Credible Intervals vs Confidence Intervals

A 95% credible interval $[a, b]$ satisfies $P(\theta \in [a,b] | \mathbf{x}) = 0.95$ . It is a statement about $\theta$ given the data.

A 95% confidence interval $[a, b]$ satisfies: if you repeat the experiment many times, 95% of the intervals will contain the true $\theta$ . It is a statement about the procedure, not about $\theta$ for the observed data.

These are structurally different interpretations. However, by Bernstein-von Mises, Bayesian credible intervals and frequentist confidence intervals coincide asymptotically. In finite samples they can differ, especially when the prior is informative.

Marginal Likelihood (Bayesian Evidence)

The denominator in Bayes rule is the marginal likelihood (also called the evidence):

$m(x) = \int p(x | \theta) \, \pi(\theta) \, d\theta$

It plays three distinct roles.

Normalization constant. It makes the posterior $p(\theta|x) = p(x|\theta)\pi(\theta)/m(x)$ integrate to 1. For pointwise inference on $\theta$ , it is a constant and can be ignored.
Bayes factor. For model comparison between $M_0$ and $M_1$ , the ratio $B_{10} = m_1(x)/m_0(x)$ quantifies evidence for $M_1$ over $M_0$ .
Empirical Bayes. Given a prior $\pi(\theta | \tau)$ with hyperparameter $\tau$ , the marginal $m(x | \tau) = \int p(x|\theta)\pi(\theta|\tau) d\theta$ can be maximized over $\tau$ to pick a data-driven prior. This is the core mechanism behind small-area estimation and shrinkage estimators.

Computation of $m(x)$ is usually intractable because the integral is high-dimensional. Standard approximations are Laplace approximation (Gaussian fit at the mode), variational inference (fit a tractable $q$ to the posterior), bridge sampling, and nested sampling.

Posterior Predictive Distribution

To predict a new observation $x_{\text{new}}$ with parameter uncertainty integrated out, use the posterior predictive distribution:

$p(x_{\text{new}} | x) = \int p(x_{\text{new}} | \theta) \, \pi(\theta | x) \, d\theta$

Unlike plugging in the MAP or posterior mean, this propagates the full posterior uncertainty into the prediction. For the beta-binomial conjugate setup, the posterior predictive for a single future trial is Bernoulli with success probability equal to the posterior mean, and the posterior predictive for $m$ future trials is Beta-binomial with parameters $(m, \alpha + k, \beta + n - k)$ .

Bayes Factors and Model Comparison

The Bayes factor comparing $M_1$ to $M_0$ is

$B_{10} = \frac{m_1(x)}{m_0(x)} = \frac{\int p(x | \theta_1, M_1) \pi(\theta_1 | M_1) d\theta_1}{\int p(x | \theta_0, M_0) \pi(\theta_0 | M_0) d\theta_0}$

Older Bayesian model-selection conventions, including Jeffreys-style scales, often describe $B_{10} > 10$ as strong evidence for $M_1$ and larger thresholds as increasingly decisive. Treat those adjectives as reporting conventions, not universal scientific guarantees: the Bayes factor depends on the model classes, the proper priors inside those classes, and the sensitivity of $m_i(x)$ to prior width.

Compared to the frequentist likelihood ratio test, Bayes factors penalize model complexity automatically: a more flexible model spreads its prior mass over more parameter values, which lowers $m(x)$ unless the data strongly prefer that flexibility. This is often called "Occam's razor via the evidence."

Lindley/Berger-Sellke point-null conflict. In point-null hypothesis testing with a diffuse proper prior under $M_1$ , the Bayes factor can favor $M_0$ for data where a frequentist $p$ -value rejects $M_0$ . The tension grows in common large- $n$ point-null settings because fixed-significance tests can flag effects of order $1/\sqrt{n}$ while the marginal likelihood under $M_1$ pays a prior spread penalty. The two frameworks are answering different questions: tail-area surprise under $M_0$ versus relative integrated predictive support for two fully specified models.

Evidence Ladder for Bayesian Claims

Bayesian language is easy to overstate. A posterior distribution is a coherent conditional distribution under a model; it is not automatically a calibrated forecast about the real world. The evidence standard depends on the claim.

Claim	What supports it	What does not support it
Posterior is mathematically correct	Likelihood, prior, and normalization are specified; conjugacy or sampler diagnostics are checked	A plot of posterior samples without the model or convergence evidence
Prior is weakly informative	Prior predictive distribution puts mass on plausible data and rules out absurd scales	Saying the prior has a large variance without checking the induced data scale
Credible interval has frequentist coverage	Simulation or asymptotic theorem under stated regularity conditions	The posterior mass alone; 95 percent posterior probability is not a coverage guarantee
Bayes factor favors a model	Marginal likelihoods are computed under proper priors and sensitivity to prior width is reported	A likelihood ratio or posterior model probability with hidden equal-prior assumptions
Bayesian method improves prediction	Posterior predictive performance beats a baseline on held-out data or a proper scoring rule	Lower training loss, a sharper posterior, or a more complex hierarchy by itself

This ladder matters for applied ML because modern probabilistic models often mix Bayesian words with approximate inference. The correct question is not "is it Bayesian?" but "which posterior, which approximation, and which calibration evidence?"

When Bayesian Methods Help

Bayesian estimation is strongest when the modeling assumptions and prior checks are explicit. Useful cases include:

Small count data with a real prior check. In the beta-binomial coin example, $\text{Beta}(10,10)$ is not "better" by declaration; it is better only if the prior predictive distribution reflects what fair-ish coins should plausibly do before the 20 observed flips arrive.
Hierarchical borrowing of strength. In small-area estimation, related areas share a population-level distribution. Sparse areas get shrinkage from the group, but the gain depends on exchangeability assumptions and on checking that the hierarchy does not erase real local heterogeneity.
Posterior predictive decisions. If the action depends on future outcomes, the posterior predictive distribution can be more useful than a plug-in MLE because it carries parameter uncertainty into the forecast. Calibration still has to be checked with held-out data, simulation, or a proper scoring rule.
Regularized estimation as a model. MAP with a Gaussian prior explains the ridge penalty as a prior assumption. That is useful when the prior scale is defensible; it is not evidence that ridge regression is automatically a calibrated Bayesian forecast.

The failure mode is symmetric: a Bayesian analysis with an unexamined prior, unchecked sampler, or misspecified likelihood can be worse than a simpler frequentist baseline. The method earns trust through prior predictive checks, posterior predictive checks, sensitivity analysis, and baseline comparison.

Worked Prior Predictive Audit

Suppose a binary classifier reports a Bayesian uncertainty layer with a $\text{Beta}(10,10)$ prior for a small validation slice: 20 cases, 15 positive outcomes. The conjugate posterior is easy to compute, but the research question is whether the prior and posterior predictive distribution are usable for decisions.

Check	Concrete action	What would weaken the claim
Prior predictive	Simulate outcome counts before seeing the 20 cases and inspect whether the prior puts mass on plausible class balances	The prior rules out balances that the deployment population often produces
Sensitivity table	Recompute posterior summaries under $\text{Beta}(1,1)$ , $\text{Beta}(2,2)$ , and $\text{Beta}(10,10)$	The decision flips under reasonable weak priors
Posterior predictive	Score held-out outcomes with log score or Brier score, not only posterior mean error	The posterior is sharp but poorly calibrated
Baseline comparison	Compare against the empirical proportion and a shrinkage frequentist estimate	Bayesian wording hides worse prediction than a simpler baseline

This audit is small, but it prevents the common failure: reporting a posterior mean while skipping prior predictive plausibility, sensitivity, and a proper scoring rule. The defensible claim is "under these priors and checks, this posterior predictive model is calibrated enough for the stated decision," not "Bayesian estimation solved uncertainty."

Common Confusions

Watch Out

The prior is not arbitrary

A common criticism is that the prior is "subjective." But the prior can be chosen systematically: use domain knowledge, previous studies, or weakly-informative priors that regularize without strongly constraining. By Bernstein-von Mises, reasonable priors all lead to the same posterior with enough data. The choice matters most when data is scarce, which is exactly when you should use prior knowledge. The Monty Hall problem is a classic example where the canonical error is not in the prior (uniform over doors is correct) but in failing to condition on the host's asymmetric action: the host must open a non-winning, non-chosen door, and this constraint makes his choice informative about the remaining door.

Watch Out

MAP is not full Bayesian inference

MAP gives a point estimate (the posterior mode). Full Bayesian inference uses the entire posterior distribution. MAP ignores posterior uncertainty and can give misleading results when the posterior is skewed or multimodal. Use the posterior mean or full posterior for uncertainty quantification.

Watch Out

Flat priors are not always noninformative

A flat (uniform) prior on $\theta$ is not flat on $\theta^2$ or $\log\theta$ . The notion of "noninformative" depends on the parameterization. Jeffreys prior $p(\theta) \propto \sqrt{I(\theta)}$ is invariant to reparameterization but can be improper (not integrating to 1). Reference priors and weakly-informative priors are more practical alternatives.

Summary

Posterior $\propto$ prior $\times$ likelihood. The posterior is a compromise between prior belief and observed data
Conjugate priors give closed-form posteriors: beta-binomial, normal-normal, gamma-Poisson
MAP estimation adds $\log p(\theta)$ to the log-likelihood, equivalent to regularization
Bernstein-von Mises: the posterior converges to $\mathcal{N}(\hat{\theta}_{\text{MLE}}, 1/(nI(\theta_0)))$ as $n \to \infty$
Credible intervals are probability statements about $\theta$ ; confidence intervals are probability statements about the procedure
Bayesian methods are most useful when data is scarce, priors are defensible, hierarchy is real, or decisions require posterior predictive uncertainty

Exercises

ExerciseCore

Problem

You have a coin that you believe is roughly fair. You choose a $\text{Beta}(10, 10)$ prior for the probability $\theta$ of heads. You flip the coin 20 times and observe 15 heads. What is the posterior distribution? What is the posterior mean?

ExerciseAdvanced

Problem

Show that MAP estimation with a Gaussian prior $\theta \sim \mathcal{N}(0, \tau^2)$ and Gaussian likelihood $x_i \sim \mathcal{N}(\theta, \sigma^2)$ is equivalent to ridge regression. What is the effective regularization parameter $\lambda$ in terms of $\sigma^2$ and $\tau^2$ ?

ExerciseResearch

Problem

The Bernstein-von Mises theorem requires the model to be correctly specified. What happens to the posterior when the model is misspecified? Give a concrete example where the posterior concentrates on a parameter value that is not the "true" parameter, and explain what that parameter represents.

References

Canonical:

Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis, 2nd ed. Springer. Ch. 4.
Bernardo, J.M., & Smith, A.F.M. (1994). Bayesian Theory. Wiley. Ch. 1-5.
Robert, C.P. (2007). The Bayesian Choice, 2nd ed. Springer. Ch. 2-4, 7.
Gelman et al. (2013). Bayesian Data Analysis, 3rd ed. CRC Press. Ch. 2-3, 5.
Lehmann, E.L., & Casella, G. (1998). Theory of Point Estimation, 2nd ed. Springer. Ch. 4, 5.
Casella, G., & Berger, R.L. (2001). Statistical Inference, 2nd ed. Duxbury. Ch. 4.7, 7.2.3.

Current:

McElreath, R. (2020). Statistical Rethinking, 2nd ed. CRC Press.
van der Vaart, A.W. (1998). Asymptotic Statistics. Cambridge. Ch. 10 (Bernstein-von Mises).

Next Topics

The natural next steps from Bayesian estimation:

Gaussian processes for ML: nonparametric Bayesian inference over functions
Variational autoencoders: variational inference when the posterior is intractable
Bayesian neural networks: neural prediction with posterior uncertainty over weights or functions

Last reviewed: July 3, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Common Probability Distributionslayer 0A · tier 1
Joint, Marginal, and Conditional Distributionslayer 0A · tier 1
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
Shrinkage Estimation and the James-Stein Estimator: Inadmissibility, SURE, and Brown's Characterizationlayer 0B · tier 1

Derived topics

18

Conjugate Priorslayer 0B · tier 1
Maximum A Posteriori (MAP) Estimationlayer 0B · tier 1
Bayesian Linear Regressionlayer 2 · tier 1
Causal Inference and the Ladder of Causationlayer 3 · tier 1
PAC-Bayes Boundslayer 3 · tier 1

+13 more on the derived-topics page.

Graph-backed continuations

Gaussian Processes for Machine Learning Variational Autoencoders Anthropic Bias and Observation Selection Bayesian Neural Networks Bayesian State Estimation Causal Inference and the Ladder of Causation Decision Theory Foundations Detection Theory Empirical Bayes vs Hierarchical Bayes Meta-Analysis No-U-Turn Sampler and Neal's Funnel PAC-Bayes Bounds Small Area Estimation Tabular Foundation Models as Bayesian Inference Engines Bayesian Linear Regression Conjugate Priors Maximum A Posteriori (MAP) Estimation