Skip to main content

Statistical Estimation

Conjugate Priors

When the prior and likelihood are paired so the posterior stays in the same family as the prior. Definition via exponential families, the standard table (Beta-Bernoulli, Dirichlet-multinomial, Normal-Normal, Normal-inverse-gamma, Gamma-Poisson), worked Normal-Normal updates in 1D and the multivariate case, and the pseudo-observation interpretation that makes conjugacy a feature, not a coincidence.

AdvancedTier 1StableCore spine~80 min

Why This Matters

A prior π(θ)\pi(\theta) is conjugate to a likelihood p(xθ)p(x \mid \theta) when the posterior π(θx)p(xθ)π(θ)\pi(\theta \mid x) \propto p(x \mid \theta) \pi(\theta) stays in the same parametric family as the prior. When that happens, posterior inference reduces to updating a small number of parameters, sequential updates are trivial (yesterday's posterior is today's prior), and predictive distributions have closed forms. Without conjugacy, the posterior is generally intractable and you need MCMC, variational inference, or Laplace approximation.

Conjugacy is not magic; it is the geometric consequence of working inside the exponential family. Once you see why, the table of standard pairs (Beta-Bernoulli, Dirichlet-multinomial, Normal-Normal, Gamma-Poisson, Normal-inverse-gamma) stops looking like a list to memorize and starts looking like a single fact viewed from different angles.

Mental Model

Two equivalent framings:

  • Algebraic: the prior density has the same functional form as a normalized power of the likelihood. Multiplying them keeps the form, only the parameters change.
  • Pseudo-observation: the prior carries "imaginary data" that contributed sufficient statistics T0T_0 to your knowledge. Real data contributes iT(xi)\sum_i T(x_i). The posterior is the prior with sufficient statistics T0+iT(xi)T_0 + \sum_i T(x_i): same family, updated parameters.

The second framing is the more useful one. It tells you immediately how strong your prior is (how many pseudo-observations it represents), why a flat prior is the limit of zero pseudo-observations, and what happens as the real sample size grows.

Formal Setup

Definition

Conjugate Prior

Let F={π(θ;τ):τT}\mathcal F = \{\pi(\theta; \tau) : \tau \in T\} be a parametric family of distributions on Θ\Theta, and let p(xθ)p(x \mid \theta) be a likelihood. The family F\mathcal F is conjugate to the likelihood if and only if, for every π(θ;τ)F\pi(\theta; \tau) \in \mathcal F and every dataset xx, the posterior

π(θx;τ)    p(xθ)π(θ;τ)\pi(\theta \mid x; \tau) \;\propto\; p(x \mid \theta) \pi(\theta; \tau)

belongs to F\mathcal F as well; that is, there exists τ=τ(τ,x)T\tau' = \tau'(\tau, x) \in T with π(θx;τ)=π(θ;τ)\pi(\theta \mid x; \tau) = \pi(\theta; \tau').

The "family" matters: if you let F\mathcal F be "all distributions on Θ\Theta", every prior is trivially conjugate. Conjugacy is interesting because F\mathcal F is finite-dimensional; usually a textbook family parameterized by a few hyperparameters.

Conjugacy from Exponential Families

The clean theoretical fact: every regular exponential-family likelihood has a natural conjugate prior, constructed by reading off the exponential-family form of the likelihood and treating its sufficient-statistic vector as a function of θ\theta.

Theorem

Conjugate Prior for an Exponential-Family Likelihood

Statement

Suppose p(xθ)=h(x)exp(η(θ)T(x)A(θ))p(x \mid \theta) = h(x) \exp(\eta(\theta)^\top T(x) - A(\theta)) is an exponential-family likelihood with natural parameter η(θ)\eta(\theta), sufficient statistic T(x)T(x), and log-partition A(θ)A(\theta). Define the conjugate prior family

π(θ;α,ν)  =  c(α,ν)exp ⁣(η(θ)ανA(θ))\pi(\theta; \alpha, \nu) \;=\; c(\alpha, \nu) \exp\!\bigl( \eta(\theta)^\top \alpha - \nu A(\theta) \bigr)

for hyperparameters αRk\alpha \in \mathbb R^k and ν>0\nu > 0 (where the prior is normalizable). For i.i.d. observations x1,,xnx_1, \dots, x_n, the posterior is in the same family, with updated hyperparameters

α=α+i=1nT(xi),ν=ν+n.\alpha' = \alpha + \sum_{i=1}^n T(x_i), \qquad \nu' = \nu + n.

Intuition

The prior is the same exponential family form, but with α\alpha playing the role of iT(xi)\sum_i T(x_i) and ν\nu playing the role of nn; exactly the sufficient statistics that a dataset would contribute. Combining prior and likelihood adds these contributions linearly. So α\alpha counts "pseudo sufficient statistics" and ν\nu counts "pseudo sample size."

Proof Sketch

The posterior up to a constant in θ\theta is

π(θx)    i=1np(xiθ)π(θ;α,ν)    exp ⁣(η(θ)[α+i=1nT(xi)](ν+n)A(θ)).\pi(\theta \mid x) \;\propto\; \prod_{i=1}^n p(x_i \mid \theta) \cdot \pi(\theta; \alpha, \nu) \;\propto\; \exp\!\Bigl( \eta(\theta)^\top \Bigl[ \alpha + \sum_{i=1}^n T(x_i) \Bigr] - (\nu + n) A(\theta) \Bigr).

This is exactly π(θ;α,ν)\pi(\theta; \alpha', \nu') with α=α+iT(xi)\alpha' = \alpha + \sum_i T(x_i) and ν=ν+n\nu' = \nu + n. The constant of proportionality is the new normalizer c(α,ν)c(\alpha', \nu').

Why It Matters

This theorem reduces every "what is the conjugate prior?" question to one mechanical operation: write the likelihood in exponential-family form, read off η\eta, TT, AA, and use them. The Beta is conjugate to Bernoulli/Binomial because the Bernoulli has T(x)=(x,1x)T(x) = (x, 1-x) and the Beta is θα1(1θ)β1\propto \theta^{\alpha - 1}(1-\theta)^{\beta - 1}; exactly the right functional form. The Dirichlet for multinomial, the Gamma for Poisson, and the Normal-inverse-gamma for Normal-with-unknown-variance follow the same pattern.

Failure Mode

Two caveats. First, the constructed family is normalizable only for a subset of hyperparameter values; the bookkeeping for which (α,ν)(\alpha, \nu) admit a proper prior is part of the analysis (e.g. α>0\alpha > 0 for Beta, ν>1\nu > -1 in some inverse-Gamma parameterizations). Second, conjugacy requires regularity of the exponential family; non-regular cases (truncated supports, boundary cases) need ad-hoc analysis. Mixture models are not in an exponential family, so the standard construction does not yield a tractable conjugate prior; this is why mixture-of-Gaussian inference needs EM or MCMC.

The Standard Table

The conjugate pairs that account for ~90% of practical Bayesian work:

Likelihood p(xθ)p(x \mid \theta)Conjugate prior π(θ)\pi(\theta)Posterior hyperparameter update
Bernoulli(θ\theta)Beta(α,β)(\alpha, \beta)α+xi,    β+nxi\alpha + \sum x_i,\;\; \beta + n - \sum x_i
Binomial(m,θ)(m, \theta)Beta(α,β)(\alpha, \beta)α+xi,    β+mnxi\alpha + \sum x_i,\;\; \beta + mn - \sum x_i
Categorical/Multinomial(pp)Dirichlet(α1,,αK)(\alpha_1,\dots,\alpha_K)αk+nk\alpha_k + n_k where nk=1[xi=k]n_k = \sum \mathbb 1[x_i = k]
Poisson(λ)(\lambda)Gamma(α,β)(\alpha, \beta)α+xi,    β+n\alpha + \sum x_i,\;\; \beta + n
Exponential(λ)(\lambda)Gamma(α,β)(\alpha, \beta)α+n,    β+xi\alpha + n,\;\; \beta + \sum x_i
Geometric(θ)(\theta)Beta(α,β)(\alpha, \beta)α+n,    β+xin\alpha + n,\;\; \beta + \sum x_i - n
Normal(μ,σ2)(\mu, \sigma^2), known σ2\sigma^2Normal(μ0,τ02)(\mu_0, \tau_0^2)precision-weighted (see below)
Normal(μ,σ2)(\mu, \sigma^2), unknown μ,σ2\mu, \sigma^2Normal-Inverse-Gamma(μ0,κ0,α0,β0)(\mu_0, \kappa_0, \alpha_0, \beta_0)NIG update (see below)
Multivariate Normal(μ,Σ)(\mu, \Sigma), known Σ\SigmaNormal(μ0,Σ0)(\mu_0, \Sigma_0)precision-weighted (multivariate, see below)
Multivariate Normal(μ,Σ)(\mu, \Sigma), unknown μ,Σ\mu, \SigmaNormal-Inverse-Wishart(μ0,κ0,Ψ,ν)(\mu_0, \kappa_0, \Psi, \nu)NIW update
Linear regression yX,βN(Xβ,σ2I)y \mid X, \beta \sim \mathcal N(X\beta, \sigma^2 I)βN(μ0,Σ0)\beta \sim \mathcal N(\mu_0, \Sigma_0) (see Bayesian linear regression)completing-the-square

Two read-offs that pay rent everywhere downstream:

  • Pseudo-observation count. α+β\alpha + \beta in the Beta-Bernoulli is your "prior sample size": a Beta(1,1)(1,1) is one pseudo-sample of each outcome (one heads, one tails), a Beta(10,10)(10, 10) is twenty pseudo-samples. Compare to the real nn to gauge how informative your prior is.
  • Mean and variance shrinkage. The posterior mean is always a convex combination of the prior mean and the MLE, with weights determined by pseudo-sample-size vs. real sample size. The posterior variance shrinks at rate 1/(n+pseudo-n)1/(n + \text{pseudo-}n).

Worked Normal-Normal Update (1D, known variance)

This is the algebra Rob will want to replay for multivariate cases. Let

xiμiidN(μ,σ2),μN(μ0,τ02),x_i \mid \mu \stackrel{\mathrm{iid}}{\sim} \mathcal N(\mu, \sigma^2), \qquad \mu \sim \mathcal N(\mu_0, \tau_0^2),

with σ2\sigma^2 known.

Step 1. Write the posterior up to a constant in μ\mu.

logπ(μx)=i=1n12σ2(xiμ)2    12τ02(μμ0)2+const.\log \pi(\mu \mid x) = \sum_{i=1}^n -\tfrac1{2\sigma^2}(x_i - \mu)^2 \;-\; \tfrac1{2\tau_0^2}(\mu - \mu_0)^2 + \text{const}.

Step 2. Collect quadratic and linear terms in μ\mu. The squared terms give

n2σ2μ212τ02μ2=12μ2(nσ2+1τ02).-\tfrac{n}{2\sigma^2} \mu^2 - \tfrac1{2\tau_0^2} \mu^2 = -\tfrac12 \mu^2 \left( \tfrac{n}{\sigma^2} + \tfrac1{\tau_0^2} \right).

The linear-in-μ\mu terms give

1σ2μixi+μ0τ02μ=μ(nxˉσ2+μ0τ02),\tfrac1{\sigma^2} \mu \sum_i x_i + \tfrac{\mu_0}{\tau_0^2} \mu = \mu \left( \tfrac{n \bar x}{\sigma^2} + \tfrac{\mu_0}{\tau_0^2} \right),

where xˉ=1nixi\bar x = \tfrac1n \sum_i x_i.

Step 3. Apply the completing-the-square recipe (see the multivariate normal page). With precision P=nσ2+1τ02P = \tfrac{n}{\sigma^2} + \tfrac1{\tau_0^2} and linear-coefficient b=nxˉσ2+μ0τ02b = \tfrac{n \bar x}{\sigma^2} + \tfrac{\mu_0}{\tau_0^2}, the posterior is Gaussian with

  μx    N(μn,τn2),τn2=τ02+nσ2,μn=τn2(μ0τ02+nxˉσ2).  \boxed{\; \mu \mid x \;\sim\; \mathcal N(\mu_n, \tau_n^2), \quad \tau_n^{-2} = \tau_0^{-2} + n \sigma^{-2}, \quad \mu_n = \tau_n^2 \left( \tfrac{\mu_0}{\tau_0^2} + \tfrac{n \bar x}{\sigma^2} \right). \;}

Equivalently, the posterior mean is a precision-weighted average of the prior mean μ0\mu_0 and the sample mean xˉ\bar x:

μn=τ02τ02+nσ2μ0  +  nσ2τ02+nσ2xˉ.\mu_n = \frac{\tau_0^{-2}}{\tau_0^{-2} + n \sigma^{-2}} \mu_0 \;+\; \frac{n \sigma^{-2}}{\tau_0^{-2} + n \sigma^{-2}} \bar x.

Three sanity checks:

  • n=0n = 0: posterior equals prior. \checkmark
  • τ0\tau_0 \to \infty (flat prior): μnxˉ\mu_n \to \bar x, τn2σ2/n\tau_n^2 \to \sigma^2 / n; exactly the MLE and its sampling variance. \checkmark
  • nn \to \infty: μnxˉ\mu_n \to \bar x, τn20\tau_n^2 \to 0; the posterior concentrates on the truth at rate 1/n1/\sqrt n. \checkmark

This is the pseudo-observation interpretation: the prior contributes τ02/σ2=σ2/τ02\tau_0^{-2} / \sigma^{-2} = \sigma^2/\tau_0^2 pseudo-samples, located at μ0\mu_0. The posterior is the pooled mean of pseudo and real samples, weighted by precision.

Multivariate Normal-Normal Update

The same derivation in Rd\mathbb R^d. Let

XiμiidN(μ,Σ),μN(μ0,Σ0),X_i \mid \mu \stackrel{\mathrm{iid}}{\sim} \mathcal N(\mu, \Sigma), \qquad \mu \sim \mathcal N(\mu_0, \Sigma_0),

with Σ\Sigma known. The posterior log-density in μ\mu is

12i(Xiμ)Σ1(Xiμ)    12(μμ0)Σ01(μμ0)+const.-\tfrac12 \sum_i (X_i - \mu)^\top \Sigma^{-1}(X_i - \mu) \;-\; \tfrac12 (\mu - \mu_0)^\top \Sigma_0^{-1} (\mu - \mu_0) + \text{const}.

Collect quadratic-in-μ\mu and linear-in-μ\mu terms:

  • Quadratic coefficient (the precision of the posterior): nΣ1+Σ01n \Sigma^{-1} + \Sigma_0^{-1}.
  • Linear coefficient: Σ1iXi+Σ01μ0=nΣ1Xˉ+Σ01μ0\Sigma^{-1} \sum_i X_i + \Sigma_0^{-1} \mu_0 = n \Sigma^{-1} \bar X + \Sigma_0^{-1} \mu_0.

By the completing-the-square recipe,

  μX1:n    N(μn,Σn),Σn1=Σ01+nΣ1,μn=Σn ⁣(Σ01μ0+nΣ1Xˉ).  \boxed{\; \mu \mid X_{1:n} \;\sim\; \mathcal N(\mu_n, \Sigma_n), \quad \Sigma_n^{-1} = \Sigma_0^{-1} + n \Sigma^{-1}, \quad \mu_n = \Sigma_n\!\left( \Sigma_0^{-1} \mu_0 + n \Sigma^{-1} \bar X \right). \;}

Precision-weighted averaging again, now in the matrix sense. The Bayesian linear regression posterior in the next page is a special case of this where the linear-regression design matrix XX replaces the nn identity contributions.

Worked Beta-Bernoulli Update

Bernoulli(θ)(\theta) has sufficient statistic T(x)=xT(x) = x and the Beta(α,β)(\alpha, \beta) density is θα1(1θ)β1/B(α,β)\theta^{\alpha-1}(1-\theta)^{\beta-1} / B(\alpha, \beta). The posterior after nn Bernoulli observations with k=xik = \sum x_i successes:

π(θx)    θk(1θ)nkθα1(1θ)β1=θα+k1(1θ)β+nk1,\pi(\theta \mid x) \;\propto\; \theta^k (1 - \theta)^{n-k} \cdot \theta^{\alpha-1}(1-\theta)^{\beta-1} = \theta^{\alpha + k - 1}(1-\theta)^{\beta + n - k - 1},

so θxBeta(α+k,β+nk)\theta \mid x \sim \mathrm{Beta}(\alpha + k, \beta + n - k).

Worked example. A coin you suspect is roughly fair: prior Beta(2,2)\mathrm{Beta}(2, 2) (mode at 1/21/2, four pseudo-observations). You flip 10 times and get 7 heads. Posterior Beta(2+7,2+3)=Beta(9,5)\mathrm{Beta}(2 + 7, 2 + 3) = \mathrm{Beta}(9, 5). Posterior mean 9/140.6439/14 \approx 0.643, posterior mode (91)/(9+52)=8/120.667(9-1)/(9+5-2) = 8/12 \approx 0.667. Compare the MLE 7/10=0.77/10 = 0.7: the posterior pulls toward the prior. With Beta(1,1)\mathrm{Beta}(1, 1) (flat) the posterior would be Beta(8,4)\mathrm{Beta}(8, 4) with mean 8/12=0.6678/12 = 0.667 and mode 0.70.7 (matching MLE).

Normal-Inverse-Gamma: Unknown Mean and Variance

When the variance is unknown, the conjugate prior for (μ,σ2)(\mu, \sigma^2) is the Normal-Inverse-Gamma (NIG\mathrm{NIG}):

σ2IG(α0,β0),μσ2N(μ0,σ2/κ0).\sigma^2 \sim \mathrm{IG}(\alpha_0, \beta_0), \qquad \mu \mid \sigma^2 \sim \mathcal N(\mu_0, \sigma^2 / \kappa_0).

This is one distribution on the joint pair, not two independent priors. The variance prior is IG(α0,β0)\mathrm{IG}(\alpha_0, \beta_0), and given the variance, the mean prior is Gaussian with variance proportional to σ2\sigma^2. The dependence is what makes the update closed-form.

The posterior after nn observations is again NIG\mathrm{NIG} with parameters

μn=κ0μ0+nxˉκ0+n,κn=κ0+n,αn=α0+n2,\mu_n = \frac{\kappa_0 \mu_0 + n \bar x}{\kappa_0 + n}, \quad \kappa_n = \kappa_0 + n, \quad \alpha_n = \alpha_0 + \tfrac n2,

βn=β0+12i=1n(xixˉ)2+κ0n(xˉμ0)22(κ0+n).\beta_n = \beta_0 + \tfrac12 \sum_{i=1}^n (x_i - \bar x)^2 + \tfrac{\kappa_0 n (\bar x - \mu_0)^2}{2(\kappa_0 + n)}.

The three contributions to βn\beta_n are: prior, residual variance, and a "mismatch between sample mean and prior mean" penalty. The last term grows when the data and prior disagree about where the mean is, inflating posterior uncertainty about σ2\sigma^2.

The marginal posterior on μ\mu (integrating out σ2\sigma^2) is Student's tt with 2αn2\alpha_n degrees of freedom, mean μn\mu_n, and scale βn/(αnκn)\sqrt{\beta_n / (\alpha_n \kappa_n)}. The Student-tt tails are what unknown-variance Bayesian intervals correctly account for.

Common Confusions

Watch Out

Conjugacy is mathematical convenience, not statistical truth

Choosing a conjugate prior because it makes the algebra easy is fine; pretending it is the right prior because it is conjugate is not. A Beta(1,1)(1, 1) prior on a click-through rate is mathematically convenient, but if your domain knowledge says click-throughs are usually below 0.05, a Beta(1,19)(1, 19) is more honest. Conjugacy is a property of the prior-likelihood pair, not of the prior alone. Use it when it lines up with your prior beliefs; reach for non-conjugate priors (with MCMC or VI) when it does not.

Watch Out

A flat prior is the limit of a vague conjugate prior, not the absence of a prior

A Beta(0,0)(0, 0) or a Normal with infinite variance are improper priors; they don't integrate to 1. Posteriors built from improper priors are valid as long as the posterior itself is proper (integrable). With nn Bernoulli observations and a Beta(0,0)(0, 0) prior, the posterior is Beta(k,nk)\mathrm{Beta}(k, n - k), which is proper as long as 0<k<n0 < k < n. Improper priors are useful as "least committal" defaults, but they break when the data alone don't determine the parameter.

Watch Out

The MAP is not the posterior mean; they coincide only for symmetric posteriors

Conjugate priors give closed-form posteriors, but the posterior mean and the posterior mode (MAP) differ whenever the posterior is skewed. For Beta(2,5)(2, 5), the posterior mean is 2/70.2862/7 \approx 0.286 and the mode is (21)/(2+52)=1/5=0.2(2-1)/(2+5-2) = 1/5 = 0.2. Which one to report depends on the loss function: posterior mean minimizes squared-error loss, posterior median minimizes absolute-error loss, posterior mode minimizes 0-1 loss. For ML purposes, the mean is almost always the right summary; the mode is convenient mainly because MAP estimation corresponds to penalized MLE.

Watch Out

Conjugacy is not unique

The same likelihood often has several conjugate prior families. For instance, the Beta is conjugate to Bernoulli, but so is the Beta extended with a finite-mixture parameterization, and so are degenerate point masses. The "natural" conjugate is the one minimally parameterized to match the exponential-family structure of the likelihood, with one free parameter per sufficient-statistic dimension plus one for ν\nu. This is what the standard table lists.

Why Conjugacy Stops Being Enough

Conjugacy is closed under priors that are single members of the family. Bayesian work in practice often wants hierarchical priors: the hyperparameters themselves get priors, and the resulting full posterior is no longer in a closed-form family even if the conditional posterior π(θτ,x)\pi(\theta \mid \tau, x) is. This is where conjugacy stops being a complete recipe and where Gibbs sampling, variational inference, and Laplace approximation step in. The conditional posteriors π(θτ,x)\pi(\theta \mid \tau, x) are still conjugate at each level, which is what makes Gibbs samplers efficient; each conditional update is a closed-form draw. Empirical Bayes is an alternative that estimates hyperparameters by maximum marginal likelihood instead of placing a prior on them.

Summary

  • Conjugate prior families are closed under Bayesian updating: prior and posterior are in the same parametric family.
  • For every regular exponential-family likelihood, the conjugate prior comes for free with hyperparameters (α,ν)(\alpha, \nu) updated to (α+T(xi),ν+n)(\alpha + \sum T(x_i), \nu + n).
  • Standard pairs to know cold: Beta-Bernoulli, Dirichlet-multinomial, Gamma-Poisson, Normal-Normal (known variance), Normal-Inverse-Gamma (unknown variance), Multivariate-Normal-Normal, Normal-Inverse-Wishart.
  • The Normal-Normal update gives a precision-weighted average of prior mean and sample mean, with posterior precision equal to the sum of prior precision and data precision.
  • The "pseudo-observation" interpretation tells you how strong your prior is: how many imaginary data points it represents.
  • Conjugacy is convenience, not correctness. It is the most useful prior structure when it matches your beliefs and a misleading default when it does not.

Exercises

ExerciseCore

Problem

You observe 8 successes in 12 Bernoulli trials. Compute the posterior under (a) a Beta(1,1)\mathrm{Beta}(1, 1) prior, (b) a Beta(5,5)\mathrm{Beta}(5, 5) prior. Report posterior mean and 95% equal-tailed credible interval for each.

ExerciseCore

Problem

Show the precision-weighted form of the Normal-Normal update by completing the square. Start from μN(μ0,τ02)\mu \sim \mathcal N(\mu_0, \tau_0^2) and xiμN(μ,σ2)x_i \mid \mu \sim \mathcal N(\mu, \sigma^2) for i=1,,ni = 1, \dots, n with σ2\sigma^2 known, and derive the posterior mean and variance.

ExerciseAdvanced

Problem

Derive the conjugate prior for a Poisson likelihood from the exponential-family recipe. Write Poisson(λ)(\lambda) in exponential-family form, read off T(x)T(x), η(λ)\eta(\lambda), and A(λ)A(\lambda), and identify the conjugate family.

ExerciseResearch

Problem

For the multivariate Normal-Normal update with known covariance Σ\Sigma, prior μN(μ0,Σ0)\mu \sim \mathcal N(\mu_0, \Sigma_0), and observations X1,,XnN(μ,Σ)X_1, \dots, X_n \sim \mathcal N(\mu, \Sigma), derive the posterior predictive distribution of a new observation Xn+1X_{n+1}.

References

Canonical:

  • Gelman, A. et al. (2013). Bayesian Data Analysis, 3rd ed. CRC Press. Ch. 2 (single-parameter models), Ch. 3 (multi-parameter models, including the full Normal-Inverse-Gamma analysis).
  • Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis, 2nd ed. Springer. Ch. 4 (conjugate priors and natural exponential families).
  • Bernardo, J.M., & Smith, A.F.M. (1994). Bayesian Theory. Wiley. Ch. 4–5.
  • Robert, C.P. (2007). The Bayesian Choice, 2nd ed. Springer. Ch. 3 (exponential families and conjugate priors).
  • Lehmann, E.L., & Casella, G. (1998). Theory of Point Estimation, 2nd ed. Springer. Ch. 4.

Current:

  • Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Ch. 4 (conjugate priors), Ch. 9 (Bayesian linear regression as a Normal-Normal-Inverse-Gamma model).
  • McElreath, R. (2020). Statistical Rethinking, 2nd ed. CRC Press.
  • Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer. §2.4.2–2.4.3 (exponential families and conjugate priors).

Next Topics

Last reviewed: May 10, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.