Statistical Estimation
Conjugate Priors
When the prior and likelihood are paired so the posterior stays in the same family as the prior. Definition via exponential families, the standard table (Beta-Bernoulli, Dirichlet-multinomial, Normal-Normal, Normal-inverse-gamma, Gamma-Poisson), worked Normal-Normal updates in 1D and the multivariate case, and the pseudo-observation interpretation that makes conjugacy a feature, not a coincidence.
Prerequisites
Why This Matters
A prior is conjugate to a likelihood when the posterior stays in the same parametric family as the prior. When that happens, posterior inference reduces to updating a small number of parameters, sequential updates are trivial (yesterday's posterior is today's prior), and predictive distributions have closed forms. Without conjugacy, the posterior is generally intractable and you need MCMC, variational inference, or Laplace approximation.
Conjugacy is not magic; it is the geometric consequence of working inside the exponential family. Once you see why, the table of standard pairs (Beta-Bernoulli, Dirichlet-multinomial, Normal-Normal, Gamma-Poisson, Normal-inverse-gamma) stops looking like a list to memorize and starts looking like a single fact viewed from different angles.
Mental Model
Two equivalent framings:
- Algebraic: the prior density has the same functional form as a normalized power of the likelihood. Multiplying them keeps the form, only the parameters change.
- Pseudo-observation: the prior carries "imaginary data" that contributed sufficient statistics to your knowledge. Real data contributes . The posterior is the prior with sufficient statistics : same family, updated parameters.
The second framing is the more useful one. It tells you immediately how strong your prior is (how many pseudo-observations it represents), why a flat prior is the limit of zero pseudo-observations, and what happens as the real sample size grows.
Formal Setup
Conjugate Prior
Let be a parametric family of distributions on , and let be a likelihood. The family is conjugate to the likelihood if and only if, for every and every dataset , the posterior
belongs to as well; that is, there exists with .
The "family" matters: if you let be "all distributions on ", every prior is trivially conjugate. Conjugacy is interesting because is finite-dimensional; usually a textbook family parameterized by a few hyperparameters.
Conjugacy from Exponential Families
The clean theoretical fact: every regular exponential-family likelihood has a natural conjugate prior, constructed by reading off the exponential-family form of the likelihood and treating its sufficient-statistic vector as a function of .
Conjugate Prior for an Exponential-Family Likelihood
Statement
Suppose is an exponential-family likelihood with natural parameter , sufficient statistic , and log-partition . Define the conjugate prior family
for hyperparameters and (where the prior is normalizable). For i.i.d. observations , the posterior is in the same family, with updated hyperparameters
Intuition
The prior is the same exponential family form, but with playing the role of and playing the role of ; exactly the sufficient statistics that a dataset would contribute. Combining prior and likelihood adds these contributions linearly. So counts "pseudo sufficient statistics" and counts "pseudo sample size."
Proof Sketch
The posterior up to a constant in is
This is exactly with and . The constant of proportionality is the new normalizer .
Why It Matters
This theorem reduces every "what is the conjugate prior?" question to one mechanical operation: write the likelihood in exponential-family form, read off , , , and use them. The Beta is conjugate to Bernoulli/Binomial because the Bernoulli has and the Beta is ; exactly the right functional form. The Dirichlet for multinomial, the Gamma for Poisson, and the Normal-inverse-gamma for Normal-with-unknown-variance follow the same pattern.
Failure Mode
Two caveats. First, the constructed family is normalizable only for a subset of hyperparameter values; the bookkeeping for which admit a proper prior is part of the analysis (e.g. for Beta, in some inverse-Gamma parameterizations). Second, conjugacy requires regularity of the exponential family; non-regular cases (truncated supports, boundary cases) need ad-hoc analysis. Mixture models are not in an exponential family, so the standard construction does not yield a tractable conjugate prior; this is why mixture-of-Gaussian inference needs EM or MCMC.
The Standard Table
The conjugate pairs that account for ~90% of practical Bayesian work:
| Likelihood | Conjugate prior | Posterior hyperparameter update |
|---|---|---|
| Bernoulli() | Beta | |
| Binomial | Beta | |
| Categorical/Multinomial() | Dirichlet | where |
| Poisson | Gamma | |
| Exponential | Gamma | |
| Geometric | Beta | |
| Normal, known | Normal | precision-weighted (see below) |
| Normal, unknown | Normal-Inverse-Gamma | NIG update (see below) |
| Multivariate Normal, known | Normal | precision-weighted (multivariate, see below) |
| Multivariate Normal, unknown | Normal-Inverse-Wishart | NIW update |
| Linear regression | (see Bayesian linear regression) | completing-the-square |
Two read-offs that pay rent everywhere downstream:
- Pseudo-observation count. in the Beta-Bernoulli is your "prior sample size": a Beta is one pseudo-sample of each outcome (one heads, one tails), a Beta is twenty pseudo-samples. Compare to the real to gauge how informative your prior is.
- Mean and variance shrinkage. The posterior mean is always a convex combination of the prior mean and the MLE, with weights determined by pseudo-sample-size vs. real sample size. The posterior variance shrinks at rate .
Worked Normal-Normal Update (1D, known variance)
This is the algebra Rob will want to replay for multivariate cases. Let
with known.
Step 1. Write the posterior up to a constant in .
Step 2. Collect quadratic and linear terms in . The squared terms give
The linear-in- terms give
where .
Step 3. Apply the completing-the-square recipe (see the multivariate normal page). With precision and linear-coefficient , the posterior is Gaussian with
Equivalently, the posterior mean is a precision-weighted average of the prior mean and the sample mean :
Three sanity checks:
- : posterior equals prior.
- (flat prior): , ; exactly the MLE and its sampling variance.
- : , ; the posterior concentrates on the truth at rate .
This is the pseudo-observation interpretation: the prior contributes pseudo-samples, located at . The posterior is the pooled mean of pseudo and real samples, weighted by precision.
Multivariate Normal-Normal Update
The same derivation in . Let
with known. The posterior log-density in is
Collect quadratic-in- and linear-in- terms:
- Quadratic coefficient (the precision of the posterior): .
- Linear coefficient: .
By the completing-the-square recipe,
Precision-weighted averaging again, now in the matrix sense. The Bayesian linear regression posterior in the next page is a special case of this where the linear-regression design matrix replaces the identity contributions.
Worked Beta-Bernoulli Update
Bernoulli has sufficient statistic and the Beta density is . The posterior after Bernoulli observations with successes:
so .
Worked example. A coin you suspect is roughly fair: prior (mode at , four pseudo-observations). You flip 10 times and get 7 heads. Posterior . Posterior mean , posterior mode . Compare the MLE : the posterior pulls toward the prior. With (flat) the posterior would be with mean and mode (matching MLE).
Normal-Inverse-Gamma: Unknown Mean and Variance
When the variance is unknown, the conjugate prior for is the Normal-Inverse-Gamma ():
This is one distribution on the joint pair, not two independent priors. The variance prior is , and given the variance, the mean prior is Gaussian with variance proportional to . The dependence is what makes the update closed-form.
The posterior after observations is again with parameters
The three contributions to are: prior, residual variance, and a "mismatch between sample mean and prior mean" penalty. The last term grows when the data and prior disagree about where the mean is, inflating posterior uncertainty about .
The marginal posterior on (integrating out ) is Student's with degrees of freedom, mean , and scale . The Student- tails are what unknown-variance Bayesian intervals correctly account for.
Common Confusions
Conjugacy is mathematical convenience, not statistical truth
Choosing a conjugate prior because it makes the algebra easy is fine; pretending it is the right prior because it is conjugate is not. A Beta prior on a click-through rate is mathematically convenient, but if your domain knowledge says click-throughs are usually below 0.05, a Beta is more honest. Conjugacy is a property of the prior-likelihood pair, not of the prior alone. Use it when it lines up with your prior beliefs; reach for non-conjugate priors (with MCMC or VI) when it does not.
A flat prior is the limit of a vague conjugate prior, not the absence of a prior
A Beta or a Normal with infinite variance are improper priors; they don't integrate to 1. Posteriors built from improper priors are valid as long as the posterior itself is proper (integrable). With Bernoulli observations and a Beta prior, the posterior is , which is proper as long as . Improper priors are useful as "least committal" defaults, but they break when the data alone don't determine the parameter.
The MAP is not the posterior mean; they coincide only for symmetric posteriors
Conjugate priors give closed-form posteriors, but the posterior mean and the posterior mode (MAP) differ whenever the posterior is skewed. For Beta, the posterior mean is and the mode is . Which one to report depends on the loss function: posterior mean minimizes squared-error loss, posterior median minimizes absolute-error loss, posterior mode minimizes 0-1 loss. For ML purposes, the mean is almost always the right summary; the mode is convenient mainly because MAP estimation corresponds to penalized MLE.
Conjugacy is not unique
The same likelihood often has several conjugate prior families. For instance, the Beta is conjugate to Bernoulli, but so is the Beta extended with a finite-mixture parameterization, and so are degenerate point masses. The "natural" conjugate is the one minimally parameterized to match the exponential-family structure of the likelihood, with one free parameter per sufficient-statistic dimension plus one for . This is what the standard table lists.
Why Conjugacy Stops Being Enough
Conjugacy is closed under priors that are single members of the family. Bayesian work in practice often wants hierarchical priors: the hyperparameters themselves get priors, and the resulting full posterior is no longer in a closed-form family even if the conditional posterior is. This is where conjugacy stops being a complete recipe and where Gibbs sampling, variational inference, and Laplace approximation step in. The conditional posteriors are still conjugate at each level, which is what makes Gibbs samplers efficient; each conditional update is a closed-form draw. Empirical Bayes is an alternative that estimates hyperparameters by maximum marginal likelihood instead of placing a prior on them.
Summary
- Conjugate prior families are closed under Bayesian updating: prior and posterior are in the same parametric family.
- For every regular exponential-family likelihood, the conjugate prior comes for free with hyperparameters updated to .
- Standard pairs to know cold: Beta-Bernoulli, Dirichlet-multinomial, Gamma-Poisson, Normal-Normal (known variance), Normal-Inverse-Gamma (unknown variance), Multivariate-Normal-Normal, Normal-Inverse-Wishart.
- The Normal-Normal update gives a precision-weighted average of prior mean and sample mean, with posterior precision equal to the sum of prior precision and data precision.
- The "pseudo-observation" interpretation tells you how strong your prior is: how many imaginary data points it represents.
- Conjugacy is convenience, not correctness. It is the most useful prior structure when it matches your beliefs and a misleading default when it does not.
Exercises
Problem
You observe 8 successes in 12 Bernoulli trials. Compute the posterior under (a) a prior, (b) a prior. Report posterior mean and 95% equal-tailed credible interval for each.
Problem
Show the precision-weighted form of the Normal-Normal update by completing the square. Start from and for with known, and derive the posterior mean and variance.
Problem
Derive the conjugate prior for a Poisson likelihood from the exponential-family recipe. Write Poisson in exponential-family form, read off , , and , and identify the conjugate family.
Problem
For the multivariate Normal-Normal update with known covariance , prior , and observations , derive the posterior predictive distribution of a new observation .
References
Canonical:
- Gelman, A. et al. (2013). Bayesian Data Analysis, 3rd ed. CRC Press. Ch. 2 (single-parameter models), Ch. 3 (multi-parameter models, including the full Normal-Inverse-Gamma analysis).
- Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis, 2nd ed. Springer. Ch. 4 (conjugate priors and natural exponential families).
- Bernardo, J.M., & Smith, A.F.M. (1994). Bayesian Theory. Wiley. Ch. 4–5.
- Robert, C.P. (2007). The Bayesian Choice, 2nd ed. Springer. Ch. 3 (exponential families and conjugate priors).
- Lehmann, E.L., & Casella, G. (1998). Theory of Point Estimation, 2nd ed. Springer. Ch. 4.
Current:
- Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Ch. 4 (conjugate priors), Ch. 9 (Bayesian linear regression as a Normal-Normal-Inverse-Gamma model).
- McElreath, R. (2020). Statistical Rethinking, 2nd ed. CRC Press.
- Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer. §2.4.2–2.4.3 (exponential families and conjugate priors).
Next Topics
- Bayesian linear regression: the Normal-Normal-Inverse-Gamma conjugate model applied to the regression coefficient.
- MAP estimation: posterior mode under a conjugate prior; how conjugate priors give clean MAP-as-regularization equivalences.
- Empirical Bayes vs hierarchical Bayes: what to do when you want a prior over hyperparameters and conjugacy is no longer enough.
Last reviewed: May 10, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
6- Common Probability Distributionslayer 0A · tier 1
- Maximum A Posteriori (MAP) Estimationlayer 0B · tier 1
- Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
- The Multivariate Normal Distributionlayer 0B · tier 1
- Bayesian Estimationlayer 0B · tier 2
Derived topics
4- Bayesian Linear Regressionlayer 2 · tier 1
- The EM Algorithmlayer 2 · tier 1
- Empirical Bayes vs Hierarchical Bayeslayer 2 · tier 2
- Gaussian Processes for Machine Learninglayer 4 · tier 3