Conjugate Priors

Sneiderman, Robby

Statistical Estimation

Conjugate Priors

When the prior and likelihood are paired so the posterior stays in the same family as the prior. Definition via exponential families, the standard table (Beta-Bernoulli, Dirichlet-multinomial, Normal-Normal, Normal-inverse-gamma, Gamma-Poisson), worked Normal-Normal updates in 1D and the multivariate case, and the pseudo-observation interpretation that makes conjugacy a feature, not a coincidence.

AdvancedTier 1StableCore spine~80 min

Prerequisites

Bayesian Estimation Maximum Likelihood Estimation Sufficient Statistics and Exponential Families Common Probability Distributions

Prereq Map

Why This Matters

A prior $\pi(\theta)$ is conjugate to a likelihood $p(x \mid \theta)$ when the posterior $\pi(\theta \mid x) \propto p(x \mid \theta) \pi(\theta)$ stays in the same parametric family as the prior. When that happens, posterior inference reduces to updating a small number of parameters, sequential updates are trivial (yesterday's posterior is today's prior), and predictive distributions have closed forms. Without conjugacy, the posterior is generally intractable and you need MCMC, variational inference, or Laplace approximation.

Conjugacy is not magic; it is the geometric consequence of working inside the exponential family. Once you see why, the table of standard pairs (Beta-Bernoulli, Dirichlet-multinomial, Normal-Normal, Gamma-Poisson, Normal-inverse-gamma) stops looking like a list to memorize and starts looking like a single fact viewed from different angles.

Mental Model

Two equivalent framings:

Algebraic: the prior density has the same functional form as a normalized power of the likelihood. Multiplying them keeps the form, only the parameters change.
Pseudo-observation: the prior carries "imaginary data" that contributed sufficient statistics $T_0$ to your knowledge. Real data contributes $\sum_i T(x_i)$ . The posterior is the prior with sufficient statistics $T_0 + \sum_i T(x_i)$ : same family, updated parameters.

The second framing is the more useful one. It tells you immediately how strong your prior is (how many pseudo-observations it represents), why a flat prior is the limit of zero pseudo-observations, and what happens as the real sample size grows.

Formal Setup

Definition

Conjugate Prior $π (θ) \in F$

Let $\mathcal F = \{\pi(\theta; \tau) : \tau \in T\}$ be a parametric family of distributions on $\Theta$ , and let $p(x \mid \theta)$ be a likelihood. The family $\mathcal F$ is conjugate to the likelihood if and only if, for every $\pi(\theta; \tau) \in \mathcal F$ and every dataset $x$ , the posterior

$\pi(\theta \mid x; \tau) \;\propto\; p(x \mid \theta) \pi(\theta; \tau)$

belongs to $\mathcal F$ as well; that is, there exists $\tau' = \tau'(\tau, x) \in T$ with $\pi(\theta \mid x; \tau) = \pi(\theta; \tau')$ .

The "family" matters: if you let $\mathcal F$ be "all distributions on $\Theta$ ", every prior is trivially conjugate. Conjugacy is interesting because $\mathcal F$ is finite-dimensional; usually a textbook family parameterized by a few hyperparameters.

Conjugacy from Exponential Families

The clean theoretical fact: every regular exponential-family likelihood has a natural conjugate prior, constructed by reading off the exponential-family form of the likelihood and treating its sufficient-statistic vector as a function of $\theta$ .

Theorem

Conjugate Prior for an Exponential-Family Likelihood

Statement

Suppose $p(x \mid \theta) = h(x) \exp(\eta(\theta)^\top T(x) - A(\theta))$ is an exponential-family likelihood with natural parameter $\eta(\theta)$ , sufficient statistic $T(x)$ , and log-partition $A(\theta)$ . Define the conjugate prior family

$\pi(\theta; \alpha, \nu) \;=\; c(\alpha, \nu) \exp\!\bigl( \eta(\theta)^\top \alpha - \nu A(\theta) \bigr)$

for hyperparameters $\alpha \in \mathbb R^k$ and $\nu > 0$ (where the prior is normalizable). For i.i.d. observations $x_1, \dots, x_n$ , the posterior is in the same family, with updated hyperparameters

$\alpha' = \alpha + \sum_{i=1}^n T(x_i), \qquad \nu' = \nu + n.$

Intuition

The prior is the same exponential family form, but with $\alpha$ playing the role of $\sum_i T(x_i)$ and $\nu$ playing the role of $n$ ; exactly the sufficient statistics that a dataset would contribute. Combining prior and likelihood adds these contributions linearly. So $\alpha$ counts "pseudo sufficient statistics" and $\nu$ counts "pseudo sample size."

Proof Sketch

The posterior up to a constant in $\theta$ is

$\pi(\theta \mid x) \;\propto\; \prod_{i=1}^n p(x_i \mid \theta) \cdot \pi(\theta; \alpha, \nu) \;\propto\; \exp\!\Bigl( \eta(\theta)^\top \Bigl[ \alpha + \sum_{i=1}^n T(x_i) \Bigr] - (\nu + n) A(\theta) \Bigr).$

This is exactly $\pi(\theta; \alpha', \nu')$ with $\alpha' = \alpha + \sum_i T(x_i)$ and $\nu' = \nu + n$ . The constant of proportionality is the new normalizer $c(\alpha', \nu')$ .

Why It Matters

This theorem reduces every "what is the conjugate prior?" question to one mechanical operation: write the likelihood in exponential-family form, read off $\eta$ , $T$ , $A$ , and use them. The Beta is conjugate to Bernoulli/Binomial because the Bernoulli has $T(x) = (x, 1-x)$ and the Beta is $\propto \theta^{\alpha - 1}(1-\theta)^{\beta - 1}$ ; exactly the right functional form. The Dirichlet for multinomial, the Gamma for Poisson, and the Normal-inverse-gamma for Normal-with-unknown-variance follow the same pattern.

Failure Mode

Two caveats. First, the constructed family is normalizable only for a subset of hyperparameter values; the bookkeeping for which $(\alpha, \nu)$ admit a proper prior is part of the analysis (e.g. $\alpha > 0$ for Beta, $\nu > -1$ in some inverse-Gamma parameterizations). Second, conjugacy requires regularity of the exponential family; non-regular cases (truncated supports, boundary cases) need ad-hoc analysis. Mixture models are not in an exponential family, so the standard construction does not yield a tractable conjugate prior; this is why mixture-of-Gaussian inference needs EM or MCMC.

report a correction →

The Standard Table

The conjugate pairs that account for ~90% of practical Bayesian work:

Likelihood $p(x \mid \theta)$	Conjugate prior $\pi(\theta)$	Posterior hyperparameter update
Bernoulli( $\theta$ )	Beta $(\alpha, \beta)$	$\alpha + \sum x_i,\;\; \beta + n - \sum x_i$
Binomial $(m, \theta)$	Beta $(\alpha, \beta)$	$\alpha + \sum x_i,\;\; \beta + mn - \sum x_i$
Categorical/Multinomial( $p$ )	Dirichlet $(\alpha_1,\dots,\alpha_K)$	$\alpha_k + n_k$ where $n_k = \sum \mathbb 1[x_i = k]$
Poisson $(\lambda)$	Gamma $(\alpha, \beta)$	$\alpha + \sum x_i,\;\; \beta + n$
Exponential $(\lambda)$	Gamma $(\alpha, \beta)$	$\alpha + n,\;\; \beta + \sum x_i$
Geometric $(\theta)$	Beta $(\alpha, \beta)$	$\alpha + n,\;\; \beta + \sum x_i - n$
Normal $(\mu, \sigma^2)$ , known $\sigma^2$	Normal $(\mu_0, \tau_0^2)$	precision-weighted (see below)
Normal $(\mu, \sigma^2)$ , unknown $\mu, \sigma^2$	Normal-Inverse-Gamma $(\mu_0, \kappa_0, \alpha_0, \beta_0)$	NIG update (see below)
Multivariate Normal $(\mu, \Sigma)$ , known $\Sigma$	Normal $(\mu_0, \Sigma_0)$	precision-weighted (multivariate, see below)
Multivariate Normal $(\mu, \Sigma)$ , unknown $\mu, \Sigma$	Normal-Inverse-Wishart $(\mu_0, \kappa_0, \Psi, \nu)$	NIW update
Linear regression $y \mid X, \beta \sim \mathcal N(X\beta, \sigma^2 I)$	$\beta \sim \mathcal N(\mu_0, \Sigma_0)$ (see Bayesian linear regression)	completing-the-square

Two read-offs that pay rent everywhere downstream:

Pseudo-observation count. $\alpha + \beta$ in the Beta-Bernoulli is your "prior sample size": a Beta $(1,1)$ is one pseudo-sample of each outcome (one heads, one tails), a Beta $(10, 10)$ is twenty pseudo-samples. Compare to the real $n$ to gauge how informative your prior is.
Mean and variance shrinkage. The posterior mean is always a convex combination of the prior mean and the MLE, with weights determined by pseudo-sample-size vs. real sample size. The posterior variance shrinks at rate $1/(n + \text{pseudo-}n)$ .

Worked Normal-Normal Update (1D, known variance)

This is the algebra Rob will want to replay for multivariate cases. Let

$x_i \mid \mu \stackrel{\mathrm{iid}}{\sim} \mathcal N(\mu, \sigma^2), \qquad \mu \sim \mathcal N(\mu_0, \tau_0^2),$

with $\sigma^2$ known.

Step 1. Write the posterior up to a constant in $\mu$ .

$\log \pi(\mu \mid x) = \sum_{i=1}^n -\tfrac1{2\sigma^2}(x_i - \mu)^2 \;-\; \tfrac1{2\tau_0^2}(\mu - \mu_0)^2 + \text{const}.$

Step 2. Collect quadratic and linear terms in $\mu$ . The squared terms give

$-\tfrac{n}{2\sigma^2} \mu^2 - \tfrac1{2\tau_0^2} \mu^2 = -\tfrac12 \mu^2 \left( \tfrac{n}{\sigma^2} + \tfrac1{\tau_0^2} \right).$

The linear-in- $\mu$ terms give

$\tfrac1{\sigma^2} \mu \sum_i x_i + \tfrac{\mu_0}{\tau_0^2} \mu = \mu \left( \tfrac{n \bar x}{\sigma^2} + \tfrac{\mu_0}{\tau_0^2} \right),$

where $\bar x = \tfrac1n \sum_i x_i$ .

Step 3. Apply the completing-the-square recipe (see the multivariate normal page). With precision $P = \tfrac{n}{\sigma^2} + \tfrac1{\tau_0^2}$ and linear-coefficient $b = \tfrac{n \bar x}{\sigma^2} + \tfrac{\mu_0}{\tau_0^2}$ , the posterior is Gaussian with

$\boxed{\; \mu \mid x \;\sim\; \mathcal N(\mu_n, \tau_n^2), \quad \tau_n^{-2} = \tau_0^{-2} + n \sigma^{-2}, \quad \mu_n = \tau_n^2 \left( \tfrac{\mu_0}{\tau_0^2} + \tfrac{n \bar x}{\sigma^2} \right). \;}$

Equivalently, the posterior mean is a precision-weighted average of the prior mean $\mu_0$ and the sample mean $\bar x$ :

$\mu_n = \frac{\tau_0^{-2}}{\tau_0^{-2} + n \sigma^{-2}} \mu_0 \;+\; \frac{n \sigma^{-2}}{\tau_0^{-2} + n \sigma^{-2}} \bar x.$

Three sanity checks:

$n = 0$ : posterior equals prior. $\checkmark$
$\tau_0 \to \infty$ (flat prior): $\mu_n \to \bar x$ , $\tau_n^2 \to \sigma^2 / n$ ; exactly the MLE and its sampling variance. $\checkmark$
$n \to \infty$ : $\mu_n \to \bar x$ , $\tau_n^2 \to 0$ ; the posterior concentrates on the truth at rate $1/\sqrt n$ . $\checkmark$

This is the pseudo-observation interpretation: the prior contributes $\tau_0^{-2} / \sigma^{-2} = \sigma^2/\tau_0^2$ pseudo-samples, located at $\mu_0$ . The posterior is the pooled mean of pseudo and real samples, weighted by precision.

Multivariate Normal-Normal Update

The same derivation in $\mathbb R^d$ . Let

$X_i \mid \mu \stackrel{\mathrm{iid}}{\sim} \mathcal N(\mu, \Sigma), \qquad \mu \sim \mathcal N(\mu_0, \Sigma_0),$

with $\Sigma$ known. The posterior log-density in $\mu$ is

$-\tfrac12 \sum_i (X_i - \mu)^\top \Sigma^{-1}(X_i - \mu) \;-\; \tfrac12 (\mu - \mu_0)^\top \Sigma_0^{-1} (\mu - \mu_0) + \text{const}.$

Collect quadratic-in- $\mu$ and linear-in- $\mu$ terms:

Quadratic coefficient (the precision of the posterior): $n \Sigma^{-1} + \Sigma_0^{-1}$ .
Linear coefficient: $\Sigma^{-1} \sum_i X_i + \Sigma_0^{-1} \mu_0 = n \Sigma^{-1} \bar X + \Sigma_0^{-1} \mu_0$ .

By the completing-the-square recipe,

$\boxed{\; \mu \mid X_{1:n} \;\sim\; \mathcal N(\mu_n, \Sigma_n), \quad \Sigma_n^{-1} = \Sigma_0^{-1} + n \Sigma^{-1}, \quad \mu_n = \Sigma_n\!\left( \Sigma_0^{-1} \mu_0 + n \Sigma^{-1} \bar X \right). \;}$

Precision-weighted averaging again, now in the matrix sense. The Bayesian linear regression posterior in the next page is a special case of this where the linear-regression design matrix $X$ replaces the $n$ identity contributions.

Worked Beta-Bernoulli Update

Bernoulli $(\theta)$ has sufficient statistic $T(x) = x$ and the Beta $(\alpha, \beta)$ density is $\theta^{\alpha-1}(1-\theta)^{\beta-1} / B(\alpha, \beta)$ . The posterior after $n$ Bernoulli observations with $k = \sum x_i$ successes:

$\pi(\theta \mid x) \;\propto\; \theta^k (1 - \theta)^{n-k} \cdot \theta^{\alpha-1}(1-\theta)^{\beta-1} = \theta^{\alpha + k - 1}(1-\theta)^{\beta + n - k - 1},$

so $\theta \mid x \sim \mathrm{Beta}(\alpha + k, \beta + n - k)$ .

Worked example. A coin you suspect is roughly fair: prior $\mathrm{Beta}(2, 2)$ (mode at $1/2$ , four pseudo-observations). You flip 10 times and get 7 heads. Posterior $\mathrm{Beta}(2 + 7, 2 + 3) = \mathrm{Beta}(9, 5)$ . Posterior mean $9/14 \approx 0.643$ , posterior mode $(9-1)/(9+5-2) = 8/12 \approx 0.667$ . Compare the MLE $7/10 = 0.7$ : the posterior pulls toward the prior. With $\mathrm{Beta}(1, 1)$ (flat) the posterior would be $\mathrm{Beta}(8, 4)$ with mean $8/12 = 0.667$ and mode $0.7$ (matching MLE).

Normal-Inverse-Gamma: Unknown Mean and Variance

When the variance is unknown, the conjugate prior for $(\mu, \sigma^2)$ is the Normal-Inverse-Gamma ( $\mathrm{NIG}$ ):

$\sigma^2 \sim \mathrm{IG}(\alpha_0, \beta_0), \qquad \mu \mid \sigma^2 \sim \mathcal N(\mu_0, \sigma^2 / \kappa_0).$

This is one distribution on the joint pair, not two independent priors. The variance prior is $\mathrm{IG}(\alpha_0, \beta_0)$ , and given the variance, the mean prior is Gaussian with variance proportional to $\sigma^2$ . The dependence is what makes the update closed-form.

The posterior after $n$ observations is again $\mathrm{NIG}$ with parameters

$\mu_n = \frac{\kappa_0 \mu_0 + n \bar x}{\kappa_0 + n}, \quad \kappa_n = \kappa_0 + n, \quad \alpha_n = \alpha_0 + \tfrac n2,$

$\beta_n = \beta_0 + \tfrac12 \sum_{i=1}^n (x_i - \bar x)^2 + \tfrac{\kappa_0 n (\bar x - \mu_0)^2}{2(\kappa_0 + n)}.$

The three contributions to $\beta_n$ are: prior, residual variance, and a "mismatch between sample mean and prior mean" penalty. The last term grows when the data and prior disagree about where the mean is, inflating posterior uncertainty about $\sigma^2$ .

The marginal posterior on $\mu$ (integrating out $\sigma^2$ ) is Student's $t$ with $2\alpha_n$ degrees of freedom, mean $\mu_n$ , and scale $\sqrt{\beta_n / (\alpha_n \kappa_n)}$ . The Student- $t$ tails are what unknown-variance Bayesian intervals correctly account for.

Common Confusions

Watch Out

Conjugacy is mathematical convenience, not statistical truth

Choosing a conjugate prior because it makes the algebra easy is fine; pretending it is the right prior because it is conjugate is not. A Beta $(1, 1)$ prior on a click-through rate is mathematically convenient, but if your domain knowledge says click-throughs are usually below 0.05, a Beta $(1, 19)$ is more honest. Conjugacy is a property of the prior-likelihood pair, not of the prior alone. Use it when it lines up with your prior beliefs; reach for non-conjugate priors (with MCMC or VI) when it does not.

Watch Out

A flat prior is the limit of a vague conjugate prior, not the absence of a prior

A Beta $(0, 0)$ or a Normal with infinite variance are improper priors; they don't integrate to 1. Posteriors built from improper priors are valid as long as the posterior itself is proper (integrable). With $n$ Bernoulli observations and a Beta $(0, 0)$ prior, the posterior is $\mathrm{Beta}(k, n - k)$ , which is proper as long as $0 < k < n$ . Improper priors are useful as "least committal" defaults, but they break when the data alone don't determine the parameter.

Watch Out

The MAP is not the posterior mean; they coincide only for symmetric posteriors

Conjugate priors give closed-form posteriors, but the posterior mean and the posterior mode (MAP) differ whenever the posterior is skewed. For Beta $(2, 5)$ , the posterior mean is $2/7 \approx 0.286$ and the mode is $(2-1)/(2+5-2) = 1/5 = 0.2$ . Which one to report depends on the loss function: posterior mean minimizes squared-error loss, posterior median minimizes absolute-error loss, posterior mode minimizes 0-1 loss. For ML purposes, the mean is almost always the right summary; the mode is convenient mainly because MAP estimation corresponds to penalized MLE.

Watch Out

Conjugacy is not unique

The same likelihood often has several conjugate prior families. For instance, the Beta is conjugate to Bernoulli, but so is the Beta extended with a finite-mixture parameterization, and so are degenerate point masses. The "natural" conjugate is the one minimally parameterized to match the exponential-family structure of the likelihood, with one free parameter per sufficient-statistic dimension plus one for $\nu$ . This is what the standard table lists.

Why Conjugacy Stops Being Enough

Conjugacy is closed under priors that are single members of the family. Bayesian work in practice often wants hierarchical priors: the hyperparameters themselves get priors, and the resulting full posterior is no longer in a closed-form family even if the conditional posterior $\pi(\theta \mid \tau, x)$ is. This is where conjugacy stops being a complete recipe and where Gibbs sampling, variational inference, and Laplace approximation step in. The conditional posteriors $\pi(\theta \mid \tau, x)$ are still conjugate at each level, which is what makes Gibbs samplers efficient; each conditional update is a closed-form draw. Empirical Bayes is an alternative that estimates hyperparameters by maximum marginal likelihood instead of placing a prior on them.

Summary

Conjugate prior families are closed under Bayesian updating: prior and posterior are in the same parametric family.
For every regular exponential-family likelihood, the conjugate prior comes for free with hyperparameters $(\alpha, \nu)$ updated to $(\alpha + \sum T(x_i), \nu + n)$ .
Standard pairs to know cold: Beta-Bernoulli, Dirichlet-multinomial, Gamma-Poisson, Normal-Normal (known variance), Normal-Inverse-Gamma (unknown variance), Multivariate-Normal-Normal, Normal-Inverse-Wishart.
The Normal-Normal update gives a precision-weighted average of prior mean and sample mean, with posterior precision equal to the sum of prior precision and data precision.
The "pseudo-observation" interpretation tells you how strong your prior is: how many imaginary data points it represents.
Conjugacy is convenience, not correctness. It is the most useful prior structure when it matches your beliefs and a misleading default when it does not.

Exercises

ExerciseCore

Problem

You observe 8 successes in 12 Bernoulli trials. Compute the posterior under (a) a $\mathrm{Beta}(1, 1)$ prior, (b) a $\mathrm{Beta}(5, 5)$ prior. Report posterior mean and 95% equal-tailed credible interval for each.

ExerciseCore

Problem

Show the precision-weighted form of the Normal-Normal update by completing the square. Start from $\mu \sim \mathcal N(\mu_0, \tau_0^2)$ and $x_i \mid \mu \sim \mathcal N(\mu, \sigma^2)$ for $i = 1, \dots, n$ with $\sigma^2$ known, and derive the posterior mean and variance.

ExerciseAdvanced

Problem

Derive the conjugate prior for a Poisson likelihood from the exponential-family recipe. Write Poisson $(\lambda)$ in exponential-family form, read off $T(x)$ , $\eta(\lambda)$ , and $A(\lambda)$ , and identify the conjugate family.

ExerciseResearch

Problem

For the multivariate Normal-Normal update with known covariance $\Sigma$ , prior $\mu \sim \mathcal N(\mu_0, \Sigma_0)$ , and observations $X_1, \dots, X_n \sim \mathcal N(\mu, \Sigma)$ , derive the posterior predictive distribution of a new observation $X_{n+1}$ .

References

Canonical:

Gelman, A. et al. (2013). Bayesian Data Analysis, 3rd ed. CRC Press. Ch. 2 (single-parameter models), Ch. 3 (multi-parameter models, including the full Normal-Inverse-Gamma analysis).
Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis, 2nd ed. Springer. Ch. 4 (conjugate priors and natural exponential families).
Bernardo, J.M., & Smith, A.F.M. (1994). Bayesian Theory. Wiley. Ch. 4–5.
Robert, C.P. (2007). The Bayesian Choice, 2nd ed. Springer. Ch. 3 (exponential families and conjugate priors).
Lehmann, E.L., & Casella, G. (1998). Theory of Point Estimation, 2nd ed. Springer. Ch. 4.

Current:

Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Ch. 4 (conjugate priors), Ch. 9 (Bayesian linear regression as a Normal-Normal-Inverse-Gamma model).
McElreath, R. (2020). Statistical Rethinking, 2nd ed. CRC Press.
Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer. §2.4.2–2.4.3 (exponential families and conjugate priors).

Next Topics

Bayesian linear regression: the Normal-Normal-Inverse-Gamma conjugate model applied to the regression coefficient.
MAP estimation: posterior mode under a conjugate prior; how conjugate priors give clean MAP-as-regularization equivalences.
Empirical Bayes vs hierarchical Bayes: what to do when you want a prior over hyperparameters and conjugacy is no longer enough.

Last reviewed: May 10, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

6

Common Probability Distributionslayer 0A · tier 1
Maximum A Posteriori (MAP) Estimationlayer 0B · tier 1
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
The Multivariate Normal Distributionlayer 0B · tier 1
Bayesian Estimationlayer 0B · tier 2

Derived topics

4

Bayesian Linear Regressionlayer 2 · tier 1
The EM Algorithmlayer 2 · tier 1
Empirical Bayes vs Hierarchical Bayeslayer 2 · tier 2
Gaussian Processes for Machine Learninglayer 4 · tier 3

Graph-backed continuations

Bayesian Linear Regression Empirical Bayes vs Hierarchical Bayes The EM Algorithm Gaussian Processes for Machine Learning