Skip to main content

Foundations

Multivariate Distributions Atlas

A navigational index of the multivariate distributions used in machine learning and statistics beyond the multivariate Gaussian: Multinomial, Multivariate-t, Dirichlet, Wishart and inverse Wishart, and the copula construction for arbitrary marginals. Each entry gives the definition, parameterization, key facts, and where the dedicated page lives.

ImportantCoreTier 2StableReference~35 min
For:MLStatsResearch

Why This Matters

The multivariate Gaussian is the only multivariate distribution most ML curricula treat in depth. That leaves five other distributions without canonical landing pages on the site: the Multinomial (the vector generalization of the Binomial), the Multivariate-t (the heavy-tailed counterpart of the multivariate Gaussian), the Dirichlet (the canonical distribution over the probability simplex, and the conjugate prior to the Multinomial), the Wishart and inverse Wishart (matrix-valued, the conjugate priors for Gaussian precision and covariance), and the copula construction (a recipe for building joint distributions with arbitrary marginals from any correlation structure).

This page is an atlas, not a full treatment. Each entry gives the definition, the parameterization that matters in practice, the key result that makes it useful, and a pointer to the dedicated page where one exists.

Quick Index

DistributionSupportParametersRole in ML
Multinomial(n,p1,,pK)(n, p_1, \ldots, p_K){(n1,,nK):nk0,nk=n}\{(n_1, \ldots, n_K) : n_k \geq 0, \sum n_k = n\}Trial count nn, probability vector pp on ΔK1\Delta^{K-1}Categorical predictions, nn-gram counts, classification loss
Multivariate-tν(μ,Σ)t_\nu(\mu, \Sigma)Rd\mathbb{R}^dLocation μ\mu, scale matrix Σ\Sigma, degrees of freedom ν\nuHeavy-tailed counterpart to MVN, Bayesian posterior over MVN mean with unknown variance
Dirichlet(α1,,αK)(\alpha_1, \ldots, \alpha_K)Probability simplex ΔK1\Delta^{K-1}Concentration vector α>0\alpha > 0Conjugate prior to Multinomial, topic-model priors, mixture weights
Wishart(V,ν)(V, \nu)Positive-definite d×dd \times d matricesScale matrix VV, degrees of freedom νd\nu \geq dSampling distribution of i(XiXˉ)(XiXˉ)\sum_{i} (X_i - \bar X)(X_i - \bar X)^\top for MVN samples
Inverse Wishart(V,ν)(V, \nu)Positive-definite d×dd \times d matricesScale matrix VV, degrees of freedom ν\nuConjugate prior on MVN covariance
Copula[0,1]d[0, 1]^dA CDF with uniform marginalsBuilds joint laws from arbitrary marginals and a chosen dependence structure

Multinomial

Definition

Multinomial Distribution

Let p=(p1,,pK)p = (p_1, \ldots, p_K) with pk0p_k \geq 0 and kpk=1\sum_k p_k = 1. The Multinomial distribution with nn trials and category probabilities pp assigns to the count vector (n1,,nK)(n_1, \ldots, n_K) with knk=n\sum_k n_k = n the probability P(N=n1,,NK=nK)=n!n1!nK!k=1Kpknk.P(N = n_1, \ldots, N_K = n_K) = \frac{n!}{n_1! \cdots n_K!} \prod_{k=1}^K p_k^{n_k}. The marginal of any single NkN_k is Binomial(n,pk)\mathrm{Binomial}(n, p_k).

Key facts:

  • Mean: E[Nk]=npk\mathbb{E}[N_k] = n p_k; variance: Var(Nk)=npk(1pk)\mathrm{Var}(N_k) = n p_k (1 - p_k); covariance: Cov(Nj,Nk)=npjpk\mathrm{Cov}(N_j, N_k) = -n p_j p_k for jkj \neq k.
  • Covariances are negative because counts add to nn: more in category jj forces fewer elsewhere. The covariance matrix is therefore singular with a one-dimensional null space spanned by (1,1,,1)(1, 1, \ldots, 1).
  • Asymptotics: for fixed pp, (Nnp)/nN(0,Σ)(N - np)/\sqrt{n} \to \mathcal N(0, \Sigma) as nn \to \infty where Σjk=pjδjkpjpk\Sigma_{jk} = p_j \delta_{jk} - p_j p_k. This is the multivariate version of the de Moivre-Laplace theorem.
  • Special case K=2K = 2: Binomial. The Multinomial is what categorical classification produces under any softmax-style model; cross-entropy loss is its negative log-likelihood.

Multivariate-t

Definition

Multivariate-t Distribution

A random vector XRdX \in \mathbb{R}^d is multivariate-t with degrees of freedom ν>0\nu > 0, location μRd\mu \in \mathbb{R}^d, and positive-definite scale matrix Σ\Sigma iff its density is f(x)=Γ(ν+d2)Γ(ν2)(νπ)d/2Σ1/2[1+1ν(xμ)Σ1(xμ)](ν+d)/2.f(x) = \frac{\Gamma(\frac{\nu + d}{2})}{\Gamma(\frac{\nu}{2}) (\nu \pi)^{d/2} |\Sigma|^{1/2}} \left[1 + \frac{1}{\nu}(x - \mu)^\top \Sigma^{-1} (x - \mu)\right]^{-(\nu + d)/2}.

Equivalently: X=μ+Z/W/νX = \mu + Z / \sqrt{W/\nu} where ZN(0,Σ)Z \sim \mathcal N(0, \Sigma) and Wχν2W \sim \chi^2_\nu are independent.

Key facts:

  • Mean exists iff ν>1\nu > 1 and equals μ\mu. Covariance exists iff ν>2\nu > 2 and equals νν2Σ\frac{\nu}{\nu - 2} \Sigma. So Σ\Sigma is the scale matrix, not the covariance.
  • Heavy tails: the kk-th absolute moment exists iff ν>k\nu > k. Each coordinate is a univariate Student-tνt_\nu.
  • Limit cases: as ν\nu \to \infty, the multivariate-t converges to N(μ,Σ)\mathcal N(\mu, \Sigma). At ν=1\nu = 1 it is the multivariate Cauchy.
  • Bayesian role: the marginal posterior over the mean of a multivariate Gaussian with unknown covariance and an inverse-Wishart prior is exactly multivariate-t. This is the multivariate version of the t-test.

Dirichlet

Definition

Dirichlet Distribution

Let α=(α1,,αK)\alpha = (\alpha_1, \ldots, \alpha_K) with αk>0\alpha_k > 0 and α0=kαk\alpha_0 = \sum_k \alpha_k. The Dirichlet distribution on the simplex ΔK1={p:pk0,kpk=1}\Delta^{K-1} = \{p : p_k \geq 0, \sum_k p_k = 1\} has density f(p)=Γ(α0)kΓ(αk)k=1Kpkαk1.f(p) = \frac{\Gamma(\alpha_0)}{\prod_k \Gamma(\alpha_k)} \prod_{k=1}^K p_k^{\alpha_k - 1}. The marginal of any pkp_k is Beta(αk,α0αk)\mathrm{Beta}(\alpha_k, \alpha_0 - \alpha_k).

Key facts:

  • Mean: E[pk]=αk/α0\mathbb{E}[p_k] = \alpha_k / \alpha_0. The vector α\alpha encodes both the mean direction and a concentration α0\alpha_0: larger α0\alpha_0 makes the distribution tighter around the mean.
  • Variance: Var(pk)=αk(α0αk)/(α02(α0+1))\mathrm{Var}(p_k) = \alpha_k (\alpha_0 - \alpha_k) / (\alpha_0^2 (\alpha_0 + 1)). The denominator α0+1\alpha_0 + 1 is what makes large α0\alpha_0 correspond to concentration.
  • Special case K=2K = 2: Beta. The Dirichlet is the natural multivariate generalization of the Beta.
  • Concentration regimes: αk<1\alpha_k < 1 for all kk pushes mass to the corners of the simplex (sparse, mode-seeking); αk=1\alpha_k = 1 gives the uniform distribution on the simplex; αk>1\alpha_k > 1 pushes mass toward the interior (smooth, averaging).
  • Topic models and mixtures: Dirichlet priors are the standard choice for mixture weights in latent Dirichlet allocation (LDA) and most Bayesian mixture models. The asymmetric Dirichlet allows different prior weights per topic.
Theorem

Dirichlet-Multinomial Conjugacy

Statement

Let the prior be pDir(α)p \sim \mathrm{Dir}(\alpha) and let NpMult(n,p)N \mid p \sim \mathrm{Mult}(n, p) be the observed counts. Then the posterior is pNDir(α+N).p \mid N \sim \mathrm{Dir}(\alpha + N). The Dirichlet is the conjugate prior to the Multinomial.

Intuition

The Dirichlet density is kpkαk1\prod_k p_k^{\alpha_k - 1} and the Multinomial likelihood is kpknk\prod_k p_k^{n_k} (dropping the multinomial coefficient, which does not depend on pp). Multiplying gives kpkαk+nk1\prod_k p_k^{\alpha_k + n_k - 1}, which is a Dirichlet with concentration vector α+N\alpha + N. The math turns prior counts into observed counts by addition.

Proof Sketch

Bayes' rule: f(pN)f(Np)f(p)kpknkkpkαk1=kpkαk+nk1f(p \mid N) \propto f(N \mid p) f(p) \propto \prod_k p_k^{n_k} \cdot \prod_k p_k^{\alpha_k - 1} = \prod_k p_k^{\alpha_k + n_k - 1}. This is the Dirichlet kernel with concentration α+N\alpha + N. The normalizing constant adjusts automatically.

Why It Matters

Conjugacy gives closed-form posterior updates with no MCMC: prior α\alpha plus data NN produces a new Dirichlet. The interpretation of α\alpha as "pseudo-counts" makes the prior easy to elicit: αk=1\alpha_k = 1 for all kk is one pseudo-observation per category. This is what makes Dirichlet-Multinomial the default in topic models, nn-gram smoothing, and Bayesian text categorization.

Failure Mode

Conjugacy is a property of the family, not of arbitrary priors over the simplex. A prior that is uniform on a subset of the simplex (a truncated Dirichlet) loses conjugacy: the posterior is no longer Dirichlet.

Wishart and Inverse Wishart

Definition

Wishart Distribution

The Wishart distribution is the distribution of W=i=1νXiXiW = \sum_{i=1}^\nu X_i X_i^\top where X1,,XνX_1, \ldots, X_\nu are i.i.d. N(0,V)\mathcal N(0, V) with VV a positive-definite d×dd \times d matrix and νd\nu \geq d. The density (when νd\nu \geq d) is f(W)=W(νd1)/2exp(12tr(V1W))2νd/2Vν/2Γd(ν/2)f(W) = \frac{|W|^{(\nu - d - 1)/2} \exp(-\tfrac{1}{2} \mathrm{tr}(V^{-1} W))}{2^{\nu d / 2} |V|^{\nu/2} \Gamma_d(\nu/2)} on the cone of positive-definite matrices, where Γd\Gamma_d is the multivariate gamma function.

Definition

Inverse Wishart Distribution

A matrix Σ\Sigma has the inverse Wishart distribution with scale VV and degrees of freedom ν\nu if Σ1Wd(V1,ν)\Sigma^{-1} \sim W_d(V^{-1}, \nu). Equivalently, the inverse Wishart density is f(Σ)=Vν/2Σ(ν+d+1)/2exp(12tr(VΣ1))2νd/2Γd(ν/2)f(\Sigma) = \frac{|V|^{\nu/2} |\Sigma|^{-(\nu + d + 1)/2} \exp(-\tfrac{1}{2} \mathrm{tr}(V \Sigma^{-1}))}{2^{\nu d / 2} \Gamma_d(\nu/2)} on the positive-definite cone.

Key facts:

  • Sampling distribution: if X1,,XnX_1, \ldots, X_n are i.i.d. N(μ,Σ)\mathcal N(\mu, \Sigma) and S=i(XiXˉ)(XiXˉ)S = \sum_i (X_i - \bar X)(X_i - \bar X)^\top, then SWd(Σ,n1)S \sim W_d(\Sigma, n - 1). The Wishart is the multivariate version of the χ2\chi^2 that appears in univariate sample-variance theory.
  • Conjugate prior: if the prior on the covariance is ΣWd1(V0,ν0)\Sigma \sim W_d^{-1}(V_0, \nu_0) and data are XiN(μ,Σ)X_i \sim \mathcal N(\mu, \Sigma), the posterior is ΣXWd1(V0+S+n(Xˉμ)(Xˉμ),ν0+n)\Sigma \mid X \sim W_d^{-1}(V_0 + S + n(\bar X - \mu)(\bar X - \mu)^\top, \nu_0 + n) when μ\mu is known.
  • Mean: E[W]=νV\mathbb{E}[W] = \nu V for Wishart; E[Σ]=V/(νd1)\mathbb{E}[\Sigma] = V / (\nu - d - 1) for inverse Wishart (when ν>d+1\nu > d + 1). The inverse Wishart has fat tails by design.
  • Univariate case d=1d = 1: Wishart(V,ν)(V, \nu) is Vχν2V \cdot \chi^2_\nu; inverse Wishart is the scaled inverse chi-squared.

The Wishart is mathematically clean but practically constraining for Bayesian work because ν\nu and VV are not separable: increasing the prior strength (degrees of freedom) also stretches the mean. The LKJ distribution is a modern alternative for correlation-matrix priors that decouples these concerns; it does not have a canonical page here yet.

Copulas

The other distributions on this page generalize specific univariate laws (Binomial → Multinomial, Normal → MVN). The copula construction is different: it lets you build a multivariate distribution from any chosen marginals plus any chosen dependence structure.

Definition

Copula

A copula is a CDF C:[0,1]d[0,1]C : [0, 1]^d \to [0, 1] whose marginals are uniform on [0,1][0, 1]. Given any univariate continuous marginals F1,,FdF_1, \ldots, F_d and any copula CC, the function F(x1,,xd)=C(F1(x1),,Fd(xd))F(x_1, \ldots, x_d) = C(F_1(x_1), \ldots, F_d(x_d)) is a valid dd-dimensional CDF with marginals F1,,FdF_1, \ldots, F_d.

Sklar's theorem says the reverse: every continuous multivariate CDF FF decomposes uniquely as a copula applied to its marginals. So the copula isolates the dependence structure from the marginals.

Standard copula families include Gaussian (correlation matrix RR, the copula of N(0,R)\mathcal N(0, R)), Student-t (tail dependence in addition to correlation), and Archimedean (Clayton, Gumbel, Frank, parameterized by a single concentration scalar).

The full treatment is on the copulas page. The atlas entry above is the navigational pointer.

When to Reach For Each

GoalPick
Categorical counts (text, classification)Multinomial
Heavy-tailed multivariate noiseMultivariate-t
Distribution over probabilities (mixture weights, topic distributions)Dirichlet
Bayesian prior on a covariance / precision matrixWishart (precision) or inverse Wishart (covariance)
Joint law with non-Gaussian marginals and a chosen dependenceCopula plus chosen marginals

Common Confusions

Watch Out

Multinomial covariances are negative

The covariance matrix of a Multinomial is Σjk=npkδjknpjpk\Sigma_{jk} = n p_k \delta_{jk} - n p_j p_k. The off-diagonal entries are strictly negative because the constraint kNk=n\sum_k N_k = n forces counts to compete. This is not noise: the rank of the covariance matrix is K1K - 1, not KK. Treating the Multinomial as having full-rank covariance breaks every subsequent computation.

Watch Out

Multivariate-t scale matrix is not the covariance

For Xtν(μ,Σ)X \sim t_\nu(\mu, \Sigma), the covariance is ν/(ν2)Σ\nu / (\nu - 2) \cdot \Sigma, not Σ\Sigma. The parameter Σ\Sigma is the scale matrix of the underlying multivariate Gaussian in the scale-mixture representation; the covariance only exists for ν>2\nu > 2 and is inflated. Fit code that treats Σ\Sigma as the covariance silently underestimates variance.

Watch Out

Dirichlet concentration is the sum, not the individual entries

The "concentration" of a Dirichlet is the scalar α0=kαk\alpha_0 = \sum_k \alpha_k. The individual αk\alpha_k encode the direction (mean is α/α0\alpha / \alpha_0). Doubling all αk\alpha_k keeps the mean fixed and tightens the distribution; doubling one αk\alpha_k shifts the mean. People who say "I set Dirichlet concentration to 0.5" usually mean "αk=0.5\alpha_k = 0.5 for all kk", which is α0=0.5K\alpha_0 = 0.5 K.

Watch Out

Wishart degrees of freedom must be at least the dimension

The Wishart density is only well-defined for νd\nu \geq d. Below that, the matrix WW is singular almost surely and the density formula does not apply. Bayesian implementations that allow ν=d1\nu = d - 1 or smaller are silently producing degenerate matrices.

Exercises

ExerciseCore

Problem

Let (N1,N2,N3)Mult(n=10,p=(0.5,0.3,0.2))(N_1, N_2, N_3) \sim \mathrm{Mult}(n = 10, p = (0.5, 0.3, 0.2)). Compute Cov(N1,N2)\mathrm{Cov}(N_1, N_2) and verify that the covariance matrix is rank-deficient.

ExerciseCore

Problem

Suppose the prior on a coin's bias is pBeta(2,2)p \sim \mathrm{Beta}(2, 2) (the K=2K = 2 case of a Dirichlet, with α=(2,2)\alpha = (2, 2)). You observe 77 heads in 1010 flips. What is the posterior distribution of pp?

ExerciseAdvanced

Problem

Let X1,,XnX_1, \ldots, X_n be i.i.d. N(0,Σ)\mathcal N(0, \Sigma) in Rd\mathbb{R}^d with n>dn > d. Define S=i=1nXiXiS = \sum_{i=1}^n X_i X_i^\top. Show that for any non-zero vRdv \in \mathbb{R}^d, the scalar vSv/vΣvv^\top S v / v^\top \Sigma v has the χn2\chi^2_n distribution. Conclude that the diagonal entries of SS have scaled chi-squared marginals.

References

Canonical:

  • Casella & Berger, Statistical Inference (2nd ed., 2002), Chapter 4 (multivariate distributions) and Chapter 5 (sample covariance)
  • Murphy, Machine Learning: A Probabilistic Perspective (MIT Press, 2012), Chapter 2.5 (Multinomial, Dirichlet) and Chapter 4.6 (Wishart, inverse Wishart)

Current:

  • Bishop, Pattern Recognition and Machine Learning (Springer, 2006), Chapter 2.2-2.3 (Dirichlet, Multinomial, conjugate priors)
  • Gelman, Carlin, Stern, Dunson, Vehtari & Rubin, Bayesian Data Analysis (3rd ed., 2013), Chapter 3 (multivariate normal model with Wishart prior)
  • Lewandowski, Kurowicka & Joe, "Generating Random Correlation Matrices Based on Vines and Extended Onion Method" (2009) — the LKJ correlation prior

Multivariate-t:

  • Kotz & Nadarajah, Multivariate t Distributions and Their Applications (Cambridge, 2004)

Next Topics

Last reviewed: May 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

3

Derived topics

0

No published topic currently declares this as a prerequisite.