Foundations
Multivariate Distributions Atlas
A navigational index of the multivariate distributions used in machine learning and statistics beyond the multivariate Gaussian: Multinomial, Multivariate-t, Dirichlet, Wishart and inverse Wishart, and the copula construction for arbitrary marginals. Each entry gives the definition, parameterization, key facts, and where the dedicated page lives.
Prerequisites
Why This Matters
The multivariate Gaussian is the only multivariate distribution most ML curricula treat in depth. That leaves five other distributions without canonical landing pages on the site: the Multinomial (the vector generalization of the Binomial), the Multivariate-t (the heavy-tailed counterpart of the multivariate Gaussian), the Dirichlet (the canonical distribution over the probability simplex, and the conjugate prior to the Multinomial), the Wishart and inverse Wishart (matrix-valued, the conjugate priors for Gaussian precision and covariance), and the copula construction (a recipe for building joint distributions with arbitrary marginals from any correlation structure).
This page is an atlas, not a full treatment. Each entry gives the definition, the parameterization that matters in practice, the key result that makes it useful, and a pointer to the dedicated page where one exists.
Quick Index
| Distribution | Support | Parameters | Role in ML |
|---|---|---|---|
| Multinomial | Trial count , probability vector on | Categorical predictions, -gram counts, classification loss | |
| Multivariate- | Location , scale matrix , degrees of freedom | Heavy-tailed counterpart to MVN, Bayesian posterior over MVN mean with unknown variance | |
| Dirichlet | Probability simplex | Concentration vector | Conjugate prior to Multinomial, topic-model priors, mixture weights |
| Wishart | Positive-definite matrices | Scale matrix , degrees of freedom | Sampling distribution of for MVN samples |
| Inverse Wishart | Positive-definite matrices | Scale matrix , degrees of freedom | Conjugate prior on MVN covariance |
| Copula | A CDF with uniform marginals | Builds joint laws from arbitrary marginals and a chosen dependence structure |
Multinomial
Multinomial Distribution
Let with and . The Multinomial distribution with trials and category probabilities assigns to the count vector with the probability The marginal of any single is .
Key facts:
- Mean: ; variance: ; covariance: for .
- Covariances are negative because counts add to : more in category forces fewer elsewhere. The covariance matrix is therefore singular with a one-dimensional null space spanned by .
- Asymptotics: for fixed , as where . This is the multivariate version of the de Moivre-Laplace theorem.
- Special case : Binomial. The Multinomial is what categorical classification produces under any softmax-style model; cross-entropy loss is its negative log-likelihood.
Multivariate-t
Multivariate-t Distribution
A random vector is multivariate-t with degrees of freedom , location , and positive-definite scale matrix iff its density is
Equivalently: where and are independent.
Key facts:
- Mean exists iff and equals . Covariance exists iff and equals . So is the scale matrix, not the covariance.
- Heavy tails: the -th absolute moment exists iff . Each coordinate is a univariate Student-.
- Limit cases: as , the multivariate-t converges to . At it is the multivariate Cauchy.
- Bayesian role: the marginal posterior over the mean of a multivariate Gaussian with unknown covariance and an inverse-Wishart prior is exactly multivariate-t. This is the multivariate version of the t-test.
Dirichlet
Dirichlet Distribution
Let with and . The Dirichlet distribution on the simplex has density The marginal of any is .
Key facts:
- Mean: . The vector encodes both the mean direction and a concentration : larger makes the distribution tighter around the mean.
- Variance: . The denominator is what makes large correspond to concentration.
- Special case : Beta. The Dirichlet is the natural multivariate generalization of the Beta.
- Concentration regimes: for all pushes mass to the corners of the simplex (sparse, mode-seeking); gives the uniform distribution on the simplex; pushes mass toward the interior (smooth, averaging).
- Topic models and mixtures: Dirichlet priors are the standard choice for mixture weights in latent Dirichlet allocation (LDA) and most Bayesian mixture models. The asymmetric Dirichlet allows different prior weights per topic.
Dirichlet-Multinomial Conjugacy
Statement
Let the prior be and let be the observed counts. Then the posterior is The Dirichlet is the conjugate prior to the Multinomial.
Intuition
The Dirichlet density is and the Multinomial likelihood is (dropping the multinomial coefficient, which does not depend on ). Multiplying gives , which is a Dirichlet with concentration vector . The math turns prior counts into observed counts by addition.
Proof Sketch
Bayes' rule: . This is the Dirichlet kernel with concentration . The normalizing constant adjusts automatically.
Why It Matters
Conjugacy gives closed-form posterior updates with no MCMC: prior plus data produces a new Dirichlet. The interpretation of as "pseudo-counts" makes the prior easy to elicit: for all is one pseudo-observation per category. This is what makes Dirichlet-Multinomial the default in topic models, -gram smoothing, and Bayesian text categorization.
Failure Mode
Conjugacy is a property of the family, not of arbitrary priors over the simplex. A prior that is uniform on a subset of the simplex (a truncated Dirichlet) loses conjugacy: the posterior is no longer Dirichlet.
Wishart and Inverse Wishart
Wishart Distribution
The Wishart distribution is the distribution of where are i.i.d. with a positive-definite matrix and . The density (when ) is on the cone of positive-definite matrices, where is the multivariate gamma function.
Inverse Wishart Distribution
A matrix has the inverse Wishart distribution with scale and degrees of freedom if . Equivalently, the inverse Wishart density is on the positive-definite cone.
Key facts:
- Sampling distribution: if are i.i.d. and , then . The Wishart is the multivariate version of the that appears in univariate sample-variance theory.
- Conjugate prior: if the prior on the covariance is and data are , the posterior is when is known.
- Mean: for Wishart; for inverse Wishart (when ). The inverse Wishart has fat tails by design.
- Univariate case : Wishart is ; inverse Wishart is the scaled inverse chi-squared.
The Wishart is mathematically clean but practically constraining for Bayesian work because and are not separable: increasing the prior strength (degrees of freedom) also stretches the mean. The LKJ distribution is a modern alternative for correlation-matrix priors that decouples these concerns; it does not have a canonical page here yet.
Copulas
The other distributions on this page generalize specific univariate laws (Binomial → Multinomial, Normal → MVN). The copula construction is different: it lets you build a multivariate distribution from any chosen marginals plus any chosen dependence structure.
Copula
A copula is a CDF whose marginals are uniform on . Given any univariate continuous marginals and any copula , the function is a valid -dimensional CDF with marginals .
Sklar's theorem says the reverse: every continuous multivariate CDF decomposes uniquely as a copula applied to its marginals. So the copula isolates the dependence structure from the marginals.
Standard copula families include Gaussian (correlation matrix , the copula of ), Student-t (tail dependence in addition to correlation), and Archimedean (Clayton, Gumbel, Frank, parameterized by a single concentration scalar).
The full treatment is on the copulas page. The atlas entry above is the navigational pointer.
When to Reach For Each
| Goal | Pick |
|---|---|
| Categorical counts (text, classification) | Multinomial |
| Heavy-tailed multivariate noise | Multivariate-t |
| Distribution over probabilities (mixture weights, topic distributions) | Dirichlet |
| Bayesian prior on a covariance / precision matrix | Wishart (precision) or inverse Wishart (covariance) |
| Joint law with non-Gaussian marginals and a chosen dependence | Copula plus chosen marginals |
Common Confusions
Multinomial covariances are negative
The covariance matrix of a Multinomial is . The off-diagonal entries are strictly negative because the constraint forces counts to compete. This is not noise: the rank of the covariance matrix is , not . Treating the Multinomial as having full-rank covariance breaks every subsequent computation.
Multivariate-t scale matrix is not the covariance
For , the covariance is , not . The parameter is the scale matrix of the underlying multivariate Gaussian in the scale-mixture representation; the covariance only exists for and is inflated. Fit code that treats as the covariance silently underestimates variance.
Dirichlet concentration is the sum, not the individual entries
The "concentration" of a Dirichlet is the scalar . The individual encode the direction (mean is ). Doubling all keeps the mean fixed and tightens the distribution; doubling one shifts the mean. People who say "I set Dirichlet concentration to 0.5" usually mean " for all ", which is .
Wishart degrees of freedom must be at least the dimension
The Wishart density is only well-defined for . Below that, the matrix is singular almost surely and the density formula does not apply. Bayesian implementations that allow or smaller are silently producing degenerate matrices.
Exercises
Problem
Let . Compute and verify that the covariance matrix is rank-deficient.
Problem
Suppose the prior on a coin's bias is (the case of a Dirichlet, with ). You observe heads in flips. What is the posterior distribution of ?
Problem
Let be i.i.d. in with . Define . Show that for any non-zero , the scalar has the distribution. Conclude that the diagonal entries of have scaled chi-squared marginals.
References
Canonical:
- Casella & Berger, Statistical Inference (2nd ed., 2002), Chapter 4 (multivariate distributions) and Chapter 5 (sample covariance)
- Murphy, Machine Learning: A Probabilistic Perspective (MIT Press, 2012), Chapter 2.5 (Multinomial, Dirichlet) and Chapter 4.6 (Wishart, inverse Wishart)
Current:
- Bishop, Pattern Recognition and Machine Learning (Springer, 2006), Chapter 2.2-2.3 (Dirichlet, Multinomial, conjugate priors)
- Gelman, Carlin, Stern, Dunson, Vehtari & Rubin, Bayesian Data Analysis (3rd ed., 2013), Chapter 3 (multivariate normal model with Wishart prior)
- Lewandowski, Kurowicka & Joe, "Generating Random Correlation Matrices Based on Vines and Extended Onion Method" (2009) — the LKJ correlation prior
Multivariate-t:
- Kotz & Nadarajah, Multivariate t Distributions and Their Applications (Cambridge, 2004)
Next Topics
- Multivariate Normal: the central member of the multivariate family, full treatment
- Copulas: arbitrary marginals plus chosen dependence
- Scale, location, and shape parameters: how the parameter roles extend from scalar to vector and matrix distributions
- Moment generating functions: multivariate MGF section connects to the Wishart and Multivariate-t
Last reviewed: May 13, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
3- Common Probability Distributionslayer 0A · tier 1
- Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
- The Multivariate Normal Distributionlayer 0B · tier 1
Derived topics
0No published topic currently declares this as a prerequisite.