Statistical Estimation
The Multivariate Normal Distribution
The multivariate Gaussian as the joint of d correlated random variables: density derivation from standard normals via affine maps, the completing-the-square recipe, Schur-complement marginals and conditionals, the MGF and characteristic function, and the algebraic identities that power every Bayesian Gaussian derivation downstream.
Prerequisites
Why This Matters
The multivariate normal is the joint distribution that every Bayesian Gaussian calculation on this site reduces to. Conditioning on observed data in Bayesian linear regression, computing the posterior in a Gaussian process, deriving the Kalman filter, and explaining why ridge regression has a Bayesian interpretation all rely on three pieces of algebra: the density form , the completing-the-square move that lets you read off the mean and covariance of a posterior from its log-density, and the Schur-complement formula for the conditional distribution of one block given another.
Most textbooks state these facts. This page derives them.
Mental Model
A standard -dimensional normal is a -tuple of independent variables. Apply an affine map with invertible and : every component of is a linear combination of independent standard normals, so is Gaussian in every direction, with mean and covariance . That is the multivariate normal.
This construction tells you what to expect:
- Affine images of Gaussians are Gaussian. Closed under linear maps and translation.
- Marginals are Gaussian. Project out a block of coordinates: the projection is an affine map.
- Conditionals are Gaussian. Fix a block and look at the rest: conditioning on a Gaussian is a linear operation in disguise (this is what the Schur complement makes precise).
- Independence equals zero covariance (a fact special to Gaussians, false in general).
Formal Setup
Multivariate Normal
A random vector has the multivariate normal distribution with mean and positive definite covariance if it has density
When is only positive semi-definite (rank ), the distribution is degenerate: lives in an -dimensional affine subspace and has no density with respect to Lebesgue measure on . The degenerate case is handled cleanly by the characteristic function (below).
The exponent is the squared Mahalanobis distance from to . Level sets are ellipsoids whose axes are the eigenvectors of , scaled by the square roots of its eigenvalues.
Deriving the Density
The cleanest derivation is by affine change of variables from the standard normal.
Density of an Affine Image of a Standard Normal
Statement
Let have density . Define with invertible. Then has density
where . So .
Intuition
The standard normal's density has a quadratic exponent . An invertible linear map stretches the level sets into ellipsoids and rescales the volume by , which produces the in the exponent and the in the normalizer.
Proof Sketch
The inverse map is , with Jacobian and . The change-of-variables formula gives
Now using . Define , so and . Substituting yields the stated form.
Why It Matters
This is the constructive definition: every multivariate normal is an affine image of . To sample in code you compute the Cholesky factor , draw , and return . The same factorization gives a clean route to log-densities: , and solving by forward substitution gives the Mahalanobis distance without inverting explicitly.
Failure Mode
If is singular, the affine map collapses onto a lower-dimensional subspace and the density formula breaks (one cannot divide by ). The distribution still exists, but lives on a flat in and must be described by its characteristic function, not its Lebesgue density.
Completing the Square: The Canonical Algebraic Move
Almost every Bayesian Gaussian derivation on this site reduces to one move: take a log-density that is a quadratic in plus a linear term plus a constant, and reorganize it as to read off the mean and the precision .
Completing the Square (Multivariate)
Suppose
with symmetric positive definite, , and . Then
So if is a probability density (up to normalization) in , then is Gaussian with mean and precision (equivalently, covariance ). The constant terms determine the normalization, which usually does not matter for identifying the distribution.
The recipe in three lines:
- Collect every term involving from the log-density.
- Match coefficients: is the coefficient of (the quadratic-form coefficient), and is the coefficient of (the linear-term vector).
- Read off and .
The proof is one line: expand . Setting matches the linear term . The constant is what falls out.
A worked completing-the-square
Suppose
The precision is , so , and the linear-coefficient vector is . The mean is
The covariance is . The constant is what survives outside the squared term.
This move is the source of every closed-form Bayesian Gaussian update: see conjugate priors, Bayesian linear regression, and the GP posterior.
Marginals and Conditionals via the Schur Complement
Partition into two blocks of sizes and :
with .
Gaussian Marginals and Conditionals
Statement
The marginal and conditional distributions are both Gaussian:
The conditional covariance is the Schur complement of in .
Intuition
The conditional mean is a linear function of the observation : shift away from in the direction , which is the regression coefficient of on . The conditional variance is smaller than the marginal variance (in the Loewner order) by exactly the amount of variance explained by — this is the matrix version of ", with equality iff and are independent."
Proof Sketch
Use the block-inverse identity for . Writing ,
Expand the joint log-density using this block form. The terms involving (with fixed) collect into
Apply the completing-the-square recipe with and : the mean is , shifted by . Combined, the conditional is .
The marginal of follows from the affine-image theorem: is an affine image of a Gaussian, with mean and covariance .
Why It Matters
This is the identity behind Gaussian inference. The Kalman filter's measurement update, the GP posterior, and the Bayesian linear regression predictive distribution are all special cases of this conditioning formula. The Schur complement is also what guarantees joint positive definiteness ( and ) and quantifies how much tells you about .
Failure Mode
The conditional formula requires invertible. If is singular, has linearly dependent components and the conditioning event either has probability one in a direction (collapsing the conditional dimension) or contradicts the data (probability zero). The pseudo-inverse replaces in the degenerate case, and the conditional lives on a coset of the null space of .
MGF and Characteristic Function: The Alternative Path
The completing-the-square route is algebraic. A second route uses the moment generating function (MGF) and characteristic function (CF), which often gives shorter proofs of distributional facts and handles the degenerate case cleanly.
For :
The CF is well-defined and identifies the distribution even when is only positive semi-definite. A standard derivation: by the affine-image theorem, with , so , and the MGF factors into a product of independent univariate normal MGFs.
Three consequences fall out without integration:
- Linear combinations: . (Apply the MGF identity to and recognize a univariate normal MGF in .)
- Independence iff zero covariance: and (jointly Gaussian) are independent iff , because the joint CF factors iff the cross term in the quadratic form vanishes.
- Affine maps: . (One line from the MGF.)
Marginal Gaussianity is not joint Gaussianity
Two normal random variables can have non-Gaussian joint distribution. The standard counterexample: let and define where is an independent random sign ( with equal probability). Then marginally, but is supported on the union of two lines ( and ) and is not jointly Gaussian. The MGF / CF identity defines a joint Gaussian; "each marginal is normal" is strictly weaker.
Uncorrelated Gaussian components are not always independent
The "uncorrelated ⇒ independent" fact requires the pair to be jointly Gaussian. In the counterexample above, , so and are uncorrelated yet clearly not independent ( always). Independence-from-zero-covariance needs the joint to be Gaussian, not just the marginals.
Sampling, Cholesky, and Computational Notes
To sample in :
- Factor by Cholesky (cost ).
- Draw ( independent standard normals).
- Return .
For evaluating densities, never invert explicitly. Solve by forward substitution (); the Mahalanobis distance is . The log-determinant is .
For high-dimensional problems where has special structure (low-rank plus diagonal, Toeplitz, banded), exploit that structure: Cholesky on a generic matrix is , but Cholesky on a banded matrix with bandwidth is , and Cholesky on a Kronecker-structured matrix factors across the Kronecker components.
Worked Two-Dimensional Example
Take and
The correlation is . The Cholesky factor satisfies
The Schur complement of in is . So
Observing (three above its mean) shifts the conditional mean of to . The conditional variance drops from the marginal to , a reduction — exactly the fraction of variance in that explains, given the correlation.
Useful Identities
The following matrix identities show up so often that it pays to know them by sight. Each one is a one-line consequence of the structure above.
- Affine combinations. when are jointly Gaussian.
- Sum of independent Gaussians. If are Gaussian, .
- Conditional precision. is the upper-left block of . The precision matrix encodes conditional independence: iff and are conditionally independent given the rest.
- Woodbury identity for low-rank updates: . Used to invert posterior covariances when the prior covariance has simple structure.
- Matrix determinant lemma: . Used to compute the marginal likelihood in Bayesian linear regression and the log-determinant of the GP covariance.
Common Confusions
The covariance matrix is not the precision matrix
The covariance encodes pairwise variances and covariances of the components; the precision encodes pairwise conditional covariances given the rest. Two coordinates can have zero covariance (independent marginally) but nonzero precision (dependent conditionally on a third), and vice versa. The precision is what controls conditional structure; the covariance is what controls marginal structure. Confusing them is the root of most graphical-model errors.
Positive definite vs positive semi-definite covariance
Most textbook formulas assume (strictly positive definite). When is only PSD, the distribution is degenerate: it concentrates on the affine subspace . The Lebesgue density does not exist (you cannot divide by ), but the characteristic function is still well-defined and identifies the distribution. In practice this happens when you condition on too many linearly dependent observations.
The MVN is the maximum-entropy distribution for given mean and covariance
Among all distributions on with a fixed mean and covariance , the Gaussian uniquely maximizes the differential entropy . This is why the Gaussian shows up as the default in maximum-entropy modeling, why the central limit theorem produces Gaussian limits, and why "I only know the first two moments" implicitly means Gaussian in many applied contexts. It is not because Gaussianity is empirically true; it is because Gaussianity is the least committal second-moment-constrained distribution.
Summary
- has density , derived from by the affine map with .
- The completing-the-square recipe: any log-density of the form is Gaussian with mean and covariance .
- Marginals are Gaussian: .
- Conditionals are Gaussian: . The conditional covariance is the Schur complement of in .
- The characteristic function identifies the distribution even when is singular.
- Independence equals zero covariance only when the joint is Gaussian. Marginal Gaussianity alone does not imply joint Gaussianity.
Exercises
Problem
Let with . Compute the conditional distribution of given .
Problem
Use the MGF to show that if and is any matrix (not necessarily square or invertible), then .
Problem
Suppose in and let with independent of . Derive the conditional distribution by completing the square in the joint log-density. Verify your answer against the Schur-complement formula applied to the joint of .
Problem
Let in . Show that for any orthogonal matrix , the distribution of equals the distribution of iff . Conclude that is the unique mean-zero -dimensional Gaussian that is rotation-invariant.
References
Canonical:
- Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer. §2.3 (the Gaussian, conditional and marginal Gaussians, Bayes' theorem for Gaussian variables, maximum-likelihood for the Gaussian).
- Hogg, R.V., McKean, J.W., & Craig, A.T. (2018). Introduction to Mathematical Statistics, 8th ed. Pearson. Ch. 3 (the multivariate normal at the level of a first graduate course).
- Anderson, T.W. (2003). An Introduction to Multivariate Statistical Analysis, 3rd ed. Wiley. Ch. 2–4 (the canonical reference for the algebraic properties).
- Mardia, K.V., Kent, J.T., & Bibby, J.M. (1979). Multivariate Analysis. Academic Press. Ch. 3.
Current:
- Vershynin, R. (2018). High-Dimensional Probability. Cambridge University Press. Ch. 3. (Concentration and tail bounds for Gaussian vectors in high dimensions.)
- Wainwright, M.J. (2019). High-Dimensional Statistics. Cambridge University Press. §2.1–2.3 (sub-Gaussian framework and the Gaussian as the canonical example).
- Murphy, K.P. (2022). Probabilistic Machine Learning: Advanced Topics. MIT Press. Ch. 2 (the MVN from the ML perspective).
- Gelman, A. et al. (2013). Bayesian Data Analysis, 3rd ed. CRC Press. Appendix A (the multivariate normal as the Bayesian's default noise model).
Next Topics
- Conjugate priors: the multivariate normal is the conjugate prior for the mean of another multivariate normal; this is the identity behind every closed-form Bayesian Gaussian model.
- Bayesian linear regression: the completing-the-square recipe in action on a regression posterior.
- Gaussian processes for ML: the infinite-dimensional generalization, with conditioning given by the same Schur-complement formula.
Last reviewed: May 10, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
7- Common Probability Distributionslayer 0A · tier 1
- Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
- Joint, Marginal, and Conditional Distributionslayer 0A · tier 1
- Positive Semidefinite Matriceslayer 0A · tier 1
- The Jacobian Matrixlayer 0A · tier 1
Derived topics
4- Conjugate Priorslayer 0B · tier 1
- Bayesian Linear Regressionlayer 2 · tier 1
- Gauss-Markov Theoremlayer 2 · tier 1
- Gaussian Processes for Machine Learninglayer 4 · tier 3
Graph-backed continuations