Multivariate Distributions Atlas

Sneiderman, Robby

Foundations

Multivariate Distributions Atlas

A navigational index of the multivariate distributions used in machine learning and statistics beyond the multivariate Gaussian: Multinomial, Multivariate-t, Dirichlet, Wishart and inverse Wishart, and the copula construction for arbitrary marginals. Each entry gives the definition, parameterization, key facts, and where the dedicated page lives.

ImportantCoreTier 2StableReference~35 min

For:MLStatsResearch

Prerequisites

Multivariate Normal Distribution Common Probability Distributions Expectation Variance Covariance Moments

Prereq Map

Why This Matters

The multivariate Gaussian is the only multivariate distribution most ML curricula treat in depth. That leaves five other distributions without canonical landing pages on the site: the Multinomial (the vector generalization of the Binomial), the Multivariate-t (the heavy-tailed counterpart of the multivariate Gaussian), the Dirichlet (the canonical distribution over the probability simplex, and the conjugate prior to the Multinomial), the Wishart and inverse Wishart (matrix-valued, the conjugate priors for Gaussian precision and covariance), and the copula construction (a recipe for building joint distributions with arbitrary marginals from any correlation structure).

This page is an atlas, not a full treatment. Each entry gives the definition, the parameterization that matters in practice, the key result that makes it useful, and a pointer to the dedicated page where one exists.

Quick Index

Distribution	Support	Parameters	Role in ML
Multinomial $(n, p_1, \ldots, p_K)$	$\{(n_1, \ldots, n_K) : n_k \geq 0, \sum n_k = n\}$	Trial count $n$ , probability vector $p$ on $\Delta^{K-1}$	Categorical predictions, $n$ -gram counts, classification loss
Multivariate- $t_\nu(\mu, \Sigma)$	$\mathbb{R}^d$	Location $\mu$ , scale matrix $\Sigma$ , degrees of freedom $\nu$	Heavy-tailed counterpart to MVN, Bayesian posterior over MVN mean with unknown variance
Dirichlet $(\alpha_1, \ldots, \alpha_K)$	Probability simplex $\Delta^{K-1}$	Concentration vector $\alpha > 0$	Conjugate prior to Multinomial, topic-model priors, mixture weights
Wishart $(V, \nu)$	Positive-definite $d \times d$ matrices	Scale matrix $V$ , degrees of freedom $\nu \geq d$	Sampling distribution of $\sum_{i} (X_i - \bar X)(X_i - \bar X)^\top$ for MVN samples
Inverse Wishart $(V, \nu)$	Positive-definite $d \times d$ matrices	Scale matrix $V$ , degrees of freedom $\nu$	Conjugate prior on MVN covariance
Copula	$[0, 1]^d$	A CDF with uniform marginals	Builds joint laws from arbitrary marginals and a chosen dependence structure

Multinomial

Definition

Multinomial Distribution $Mult (n, p)$

Let $p = (p_1, \ldots, p_K)$ with $p_k \geq 0$ and $\sum_k p_k = 1$ . The Multinomial distribution with $n$ trials and category probabilities $p$ assigns to the count vector $(n_1, \ldots, n_K)$ with $\sum_k n_k = n$ the probability $P(N = n_1, \ldots, N_K = n_K) = \frac{n!}{n_1! \cdots n_K!} \prod_{k=1}^K p_k^{n_k}.$ The marginal of any single $N_k$ is $\mathrm{Binomial}(n, p_k)$ .

Key facts:

Mean: $\mathbb{E}[N_k] = n p_k$ ; variance: $\mathrm{Var}(N_k) = n p_k (1 - p_k)$ ; covariance: $\mathrm{Cov}(N_j, N_k) = -n p_j p_k$ for $j \neq k$ .
Covariances are negative because counts add to $n$ : more in category $j$ forces fewer elsewhere. The covariance matrix is therefore singular with a one-dimensional null space spanned by $(1, 1, \ldots, 1)$ .
Asymptotics: for fixed $p$ , $(N - np)/\sqrt{n} \to \mathcal N(0, \Sigma)$ as $n \to \infty$ where $\Sigma_{jk} = p_j \delta_{jk} - p_j p_k$ . This is the multivariate version of the de Moivre-Laplace theorem.
Special case $K = 2$ : Binomial. The Multinomial is what categorical classification produces under any softmax-style model; cross-entropy loss is its negative log-likelihood.

Multivariate-t

Definition

Multivariate-t Distribution $t_{ν} (μ, Σ)$

A random vector $X \in \mathbb{R}^d$ is multivariate-t with degrees of freedom $\nu > 0$ , location $\mu \in \mathbb{R}^d$ , and positive-definite scale matrix $\Sigma$ iff its density is $f(x) = \frac{\Gamma(\frac{\nu + d}{2})}{\Gamma(\frac{\nu}{2}) (\nu \pi)^{d/2} |\Sigma|^{1/2}} \left[1 + \frac{1}{\nu}(x - \mu)^\top \Sigma^{-1} (x - \mu)\right]^{-(\nu + d)/2}.$

Equivalently: $X = \mu + Z / \sqrt{W/\nu}$ where $Z \sim \mathcal N(0, \Sigma)$ and $W \sim \chi^2_\nu$ are independent.

Key facts:

Mean exists iff $\nu > 1$ and equals $\mu$ . Covariance exists iff $\nu > 2$ and equals $\frac{\nu}{\nu - 2} \Sigma$ . So $\Sigma$ is the scale matrix, not the covariance.
Heavy tails: the $k$ -th absolute moment exists iff $\nu > k$ . Each coordinate is a univariate Student- $t_\nu$ .
Limit cases: as $\nu \to \infty$ , the multivariate-t converges to $\mathcal N(\mu, \Sigma)$ . At $\nu = 1$ it is the multivariate Cauchy.
Bayesian role: the marginal posterior over the mean of a multivariate Gaussian with unknown covariance and an inverse-Wishart prior is exactly multivariate-t. This is the multivariate version of the t-test.

Dirichlet

Definition

Dirichlet Distribution $Dir (α)$

Let $\alpha = (\alpha_1, \ldots, \alpha_K)$ with $\alpha_k > 0$ and $\alpha_0 = \sum_k \alpha_k$ . The Dirichlet distribution on the simplex $\Delta^{K-1} = \{p : p_k \geq 0, \sum_k p_k = 1\}$ has density $f(p) = \frac{\Gamma(\alpha_0)}{\prod_k \Gamma(\alpha_k)} \prod_{k=1}^K p_k^{\alpha_k - 1}.$ The marginal of any $p_k$ is $\mathrm{Beta}(\alpha_k, \alpha_0 - \alpha_k)$ .

Key facts:

Mean: $\mathbb{E}[p_k] = \alpha_k / \alpha_0$ . The vector $\alpha$ encodes both the mean direction and a concentration $\alpha_0$ : larger $\alpha_0$ makes the distribution tighter around the mean.
Variance: $\mathrm{Var}(p_k) = \alpha_k (\alpha_0 - \alpha_k) / (\alpha_0^2 (\alpha_0 + 1))$ . The denominator $\alpha_0 + 1$ is what makes large $\alpha_0$ correspond to concentration.
Special case $K = 2$ : Beta. The Dirichlet is the natural multivariate generalization of the Beta.
Concentration regimes: $\alpha_k < 1$ for all $k$ pushes mass to the corners of the simplex (sparse, mode-seeking); $\alpha_k = 1$ gives the uniform distribution on the simplex; $\alpha_k > 1$ pushes mass toward the interior (smooth, averaging).
Topic models and mixtures: Dirichlet priors are the standard choice for mixture weights in latent Dirichlet allocation (LDA) and most Bayesian mixture models. The asymmetric Dirichlet allows different prior weights per topic.

Theorem

Dirichlet-Multinomial Conjugacy

Statement

Let the prior be $p \sim \mathrm{Dir}(\alpha)$ and let $N \mid p \sim \mathrm{Mult}(n, p)$ be the observed counts. Then the posterior is $p \mid N \sim \mathrm{Dir}(\alpha + N).$ The Dirichlet is the conjugate prior to the Multinomial.

Intuition

The Dirichlet density is $\prod_k p_k^{\alpha_k - 1}$ and the Multinomial likelihood is $\prod_k p_k^{n_k}$ (dropping the multinomial coefficient, which does not depend on $p$ ). Multiplying gives $\prod_k p_k^{\alpha_k + n_k - 1}$ , which is a Dirichlet with concentration vector $\alpha + N$ . The math turns prior counts into observed counts by addition.

Proof Sketch

Bayes' rule: $f(p \mid N) \propto f(N \mid p) f(p) \propto \prod_k p_k^{n_k} \cdot \prod_k p_k^{\alpha_k - 1} = \prod_k p_k^{\alpha_k + n_k - 1}$ . This is the Dirichlet kernel with concentration $\alpha + N$ . The normalizing constant adjusts automatically.

Why It Matters

Conjugacy gives closed-form posterior updates with no MCMC: prior $\alpha$ plus data $N$ produces a new Dirichlet. The interpretation of $\alpha$ as "pseudo-counts" makes the prior easy to elicit: $\alpha_k = 1$ for all $k$ is one pseudo-observation per category. This is what makes Dirichlet-Multinomial the default in topic models, $n$ -gram smoothing, and Bayesian text categorization.

Failure Mode

Conjugacy is a property of the family, not of arbitrary priors over the simplex. A prior that is uniform on a subset of the simplex (a truncated Dirichlet) loses conjugacy: the posterior is no longer Dirichlet.

report a correction →

Wishart and Inverse Wishart

Definition

Wishart Distribution $W_{d} (V, ν)$

The Wishart distribution is the distribution of $W = \sum_{i=1}^\nu X_i X_i^\top$ where $X_1, \ldots, X_\nu$ are i.i.d. $\mathcal N(0, V)$ with $V$ a positive-definite $d \times d$ matrix and $\nu \geq d$ . The density (when $\nu \geq d$ ) is $f(W) = \frac{|W|^{(\nu - d - 1)/2} \exp(-\tfrac{1}{2} \mathrm{tr}(V^{-1} W))}{2^{\nu d / 2} |V|^{\nu/2} \Gamma_d(\nu/2)}$ on the cone of positive-definite matrices, where $\Gamma_d$ is the multivariate gamma function.

Definition

Inverse Wishart Distribution $W_{d}^{- 1} (V, ν)$

A matrix $\Sigma$ has the inverse Wishart distribution with scale $V$ and degrees of freedom $\nu$ if $\Sigma^{-1} \sim W_d(V^{-1}, \nu)$ . Equivalently, the inverse Wishart density is $f(\Sigma) = \frac{|V|^{\nu/2} |\Sigma|^{-(\nu + d + 1)/2} \exp(-\tfrac{1}{2} \mathrm{tr}(V \Sigma^{-1}))}{2^{\nu d / 2} \Gamma_d(\nu/2)}$ on the positive-definite cone.

Key facts:

Sampling distribution: if $X_1, \ldots, X_n$ are i.i.d. $\mathcal N(\mu, \Sigma)$ and $S = \sum_i (X_i - \bar X)(X_i - \bar X)^\top$ , then $S \sim W_d(\Sigma, n - 1)$ . The Wishart is the multivariate version of the $\chi^2$ that appears in univariate sample-variance theory.
Conjugate prior: if the prior on the covariance is $\Sigma \sim W_d^{-1}(V_0, \nu_0)$ and data are $X_i \sim \mathcal N(\mu, \Sigma)$ , the posterior is $\Sigma \mid X \sim W_d^{-1}(V_0 + S + n(\bar X - \mu)(\bar X - \mu)^\top, \nu_0 + n)$ when $\mu$ is known.
Mean: $\mathbb{E}[W] = \nu V$ for Wishart; $\mathbb{E}[\Sigma] = V / (\nu - d - 1)$ for inverse Wishart (when $\nu > d + 1$ ). The inverse Wishart has fat tails by design.
Univariate case $d = 1$ : Wishart $(V, \nu)$ is $V \cdot \chi^2_\nu$ ; inverse Wishart is the scaled inverse chi-squared.

The Wishart is mathematically clean but practically constraining for Bayesian work because $\nu$ and $V$ are not separable: increasing the prior strength (degrees of freedom) also stretches the mean. The LKJ distribution is a modern alternative for correlation-matrix priors that decouples these concerns; it does not have a canonical page here yet.

Copulas

The other distributions on this page generalize specific univariate laws (Binomial → Multinomial, Normal → MVN). The copula construction is different: it lets you build a multivariate distribution from any chosen marginals plus any chosen dependence structure.

Definition

Copula

A copula is a CDF $C : [0, 1]^d \to [0, 1]$ whose marginals are uniform on $[0, 1]$ . Given any univariate continuous marginals $F_1, \ldots, F_d$ and any copula $C$ , the function $F(x_1, \ldots, x_d) = C(F_1(x_1), \ldots, F_d(x_d))$ is a valid $d$ -dimensional CDF with marginals $F_1, \ldots, F_d$ .

Sklar's theorem says the reverse: every continuous multivariate CDF $F$ decomposes uniquely as a copula applied to its marginals. So the copula isolates the dependence structure from the marginals.

Standard copula families include Gaussian (correlation matrix $R$ , the copula of $\mathcal N(0, R)$ ), Student-t (tail dependence in addition to correlation), and Archimedean (Clayton, Gumbel, Frank, parameterized by a single concentration scalar).

The full treatment is on the copulas page. The atlas entry above is the navigational pointer.

When to Reach For Each

Goal	Pick
Categorical counts (text, classification)	Multinomial
Heavy-tailed multivariate noise	Multivariate-t
Distribution over probabilities (mixture weights, topic distributions)	Dirichlet
Bayesian prior on a covariance / precision matrix	Wishart (precision) or inverse Wishart (covariance)
Joint law with non-Gaussian marginals and a chosen dependence	Copula plus chosen marginals

Common Confusions

Watch Out

Multinomial covariances are negative

The covariance matrix of a Multinomial is $\Sigma_{jk} = n p_k \delta_{jk} - n p_j p_k$ . The off-diagonal entries are strictly negative because the constraint $\sum_k N_k = n$ forces counts to compete. This is not noise: the rank of the covariance matrix is $K - 1$ , not $K$ . Treating the Multinomial as having full-rank covariance breaks every subsequent computation.

Watch Out

Multivariate-t scale matrix is not the covariance

For $X \sim t_\nu(\mu, \Sigma)$ , the covariance is $\nu / (\nu - 2) \cdot \Sigma$ , not $\Sigma$ . The parameter $\Sigma$ is the scale matrix of the underlying multivariate Gaussian in the scale-mixture representation; the covariance only exists for $\nu > 2$ and is inflated. Fit code that treats $\Sigma$ as the covariance silently underestimates variance.

Watch Out

Dirichlet concentration is the sum, not the individual entries

The "concentration" of a Dirichlet is the scalar $\alpha_0 = \sum_k \alpha_k$ . The individual $\alpha_k$ encode the direction (mean is $\alpha / \alpha_0$ ). Doubling all $\alpha_k$ keeps the mean fixed and tightens the distribution; doubling one $\alpha_k$ shifts the mean. People who say "I set Dirichlet concentration to 0.5" usually mean " $\alpha_k = 0.5$ for all $k$ ", which is $\alpha_0 = 0.5 K$ .

Watch Out

Wishart degrees of freedom must be at least the dimension

The Wishart density is only well-defined for $\nu \geq d$ . Below that, the matrix $W$ is singular almost surely and the density formula does not apply. Bayesian implementations that allow $\nu = d - 1$ or smaller are silently producing degenerate matrices.

Exercises

ExerciseCore

Problem

Let $(N_1, N_2, N_3) \sim \mathrm{Mult}(n = 10, p = (0.5, 0.3, 0.2))$ . Compute $\mathrm{Cov}(N_1, N_2)$ and verify that the covariance matrix is rank-deficient.

ExerciseCore

Problem

Suppose the prior on a coin's bias is $p \sim \mathrm{Beta}(2, 2)$ (the $K = 2$ case of a Dirichlet, with $\alpha = (2, 2)$ ). You observe $7$ heads in $10$ flips. What is the posterior distribution of $p$ ?

ExerciseAdvanced

Problem

Let $X_1, \ldots, X_n$ be i.i.d. $\mathcal N(0, \Sigma)$ in $\mathbb{R}^d$ with $n > d$ . Define $S = \sum_{i=1}^n X_i X_i^\top$ . Show that for any non-zero $v \in \mathbb{R}^d$ , the scalar $v^\top S v / v^\top \Sigma v$ has the $\chi^2_n$ distribution. Conclude that the diagonal entries of $S$ have scaled chi-squared marginals.

References

Canonical:

Casella & Berger, Statistical Inference (2nd ed., 2002), Chapter 4 (multivariate distributions) and Chapter 5 (sample covariance)
Murphy, Machine Learning: A Probabilistic Perspective (MIT Press, 2012), Chapter 2.5 (Multinomial, Dirichlet) and Chapter 4.6 (Wishart, inverse Wishart)

Current:

Bishop, Pattern Recognition and Machine Learning (Springer, 2006), Chapter 2.2-2.3 (Dirichlet, Multinomial, conjugate priors)
Gelman, Carlin, Stern, Dunson, Vehtari & Rubin, Bayesian Data Analysis (3rd ed., 2013), Chapter 3 (multivariate normal model with Wishart prior)
Lewandowski, Kurowicka & Joe, "Generating Random Correlation Matrices Based on Vines and Extended Onion Method" (2009) — the LKJ correlation prior

Multivariate-t:

Kotz & Nadarajah, Multivariate t Distributions and Their Applications (Cambridge, 2004)

Next Topics

Multivariate Normal: the central member of the multivariate family, full treatment
Copulas: arbitrary marginals plus chosen dependence
Scale, location, and shape parameters: how the parameter roles extend from scalar to vector and matrix distributions
Moment generating functions: multivariate MGF section connects to the Wishart and Multivariate-t

Last reviewed: May 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Common Probability Distributionslayer 0A · tier 1
Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
The Multivariate Normal Distributionlayer 0B · tier 1

Derived topics

0

No published topic currently declares this as a prerequisite.