The Multivariate Normal Distribution

Sneiderman, Robby

Statistical Estimation

The Multivariate Normal Distribution

The multivariate Gaussian as the joint of d correlated random variables: density derivation from standard normals via affine maps, the completing-the-square recipe, Schur-complement marginals and conditionals, the MGF and characteristic function, and the algebraic identities that power every Bayesian Gaussian derivation downstream.

AdvancedTier 1StableCore spine~85 min

Prerequisites

Common Probability Distributions Joint Marginal Conditional Distributions Expectation Variance Covariance Moments Positive Semidefinite Matrices

Prereq Map

Why This Matters

The multivariate normal is the joint distribution that every Bayesian Gaussian calculation on this site reduces to. Conditioning on observed data in Bayesian linear regression, computing the posterior in a Gaussian process, deriving the Kalman filter, and explaining why ridge regression has a Bayesian interpretation all rely on three pieces of algebra: the density form $-\tfrac12(x-\mu)^\top \Sigma^{-1}(x-\mu)$ , the completing-the-square move that lets you read off the mean and covariance of a posterior from its log-density, and the Schur-complement formula for the conditional distribution of one block given another.

Most textbooks state these facts. This page derives them.

Mental Model

A standard $d$ -dimensional normal $Z \sim \mathcal N(0, I_d)$ is a $d$ -tuple of independent $\mathcal N(0,1)$ variables. Apply an affine map $X = AZ + \mu$ with $A \in \mathbb R^{d \times d}$ invertible and $\Sigma = AA^\top$ : every component of $X$ is a linear combination of independent standard normals, so $X$ is Gaussian in every direction, with mean $\mu$ and covariance $\Sigma$ . That is the multivariate normal.

This construction tells you what to expect:

Affine images of Gaussians are Gaussian. Closed under linear maps and translation.
Marginals are Gaussian. Project out a block of coordinates: the projection is an affine map.
Conditionals are Gaussian. Fix a block and look at the rest: conditioning on a Gaussian is a linear operation in disguise (this is what the Schur complement makes precise).
Independence equals zero covariance (a fact special to Gaussians, false in general).

Formal Setup

Definition

Multivariate Normal $X \sim N (μ, Σ)$

A random vector $X \in \mathbb R^d$ has the multivariate normal distribution with mean $\mu \in \mathbb R^d$ and positive definite covariance $\Sigma \in \mathbb R^{d \times d}$ if it has density

$p(x) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp\!\left(-\tfrac12 (x - \mu)^\top \Sigma^{-1} (x - \mu)\right) \quad \text{for } x \in \mathbb R^d.$

When $\Sigma$ is only positive semi-definite (rank $r < d$ ), the distribution is degenerate: $X$ lives in an $r$ -dimensional affine subspace and has no density with respect to Lebesgue measure on $\mathbb R^d$ . The degenerate case is handled cleanly by the characteristic function (below).

The exponent $(x-\mu)^\top \Sigma^{-1}(x-\mu)$ is the squared Mahalanobis distance from $x$ to $\mu$ . Level sets are ellipsoids whose axes are the eigenvectors of $\Sigma$ , scaled by the square roots of its eigenvalues.

Deriving the Density

The cleanest derivation is by affine change of variables from the standard normal.

Theorem

Density of an Affine Image of a Standard Normal

Statement

Let $Z \sim \mathcal N(0, I_d)$ have density $\varphi(z) = (2\pi)^{-d/2} \exp(-\tfrac12 z^\top z)$ . Define $X = AZ + \mu$ with $A$ invertible. Then $X$ has density

$p_X(x) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp\!\left(-\tfrac12 (x-\mu)^\top \Sigma^{-1} (x-\mu)\right)$

where $\Sigma = AA^\top$ . So $X \sim \mathcal N(\mu, \Sigma)$ .

Intuition

The standard normal's density has a quadratic exponent $-\tfrac12 z^\top z$ . An invertible linear map stretches the level sets into ellipsoids and rescales the volume by $|A|$ , which produces the $\Sigma^{-1}$ in the exponent and the $|\Sigma|^{1/2}$ in the normalizer.

Proof Sketch

The inverse map is $z = A^{-1}(x - \mu)$ , with Jacobian $\partial z / \partial x = A^{-1}$ and $|A^{-1}| = 1/|A|$ . The change-of-variables formula gives

$p_X(x) = \varphi(A^{-1}(x-\mu)) \cdot |A^{-1}| = \frac{1}{(2\pi)^{d/2} |A|} \exp\!\left(-\tfrac12 (A^{-1}(x-\mu))^\top (A^{-1}(x-\mu))\right).$

Now $(A^{-1}(x-\mu))^\top (A^{-1}(x-\mu)) = (x-\mu)^\top (A^{-1})^\top A^{-1} (x-\mu) = (x-\mu)^\top (AA^\top)^{-1}(x-\mu)$ using $(A^{-1})^\top A^{-1} = (AA^\top)^{-1}$ . Define $\Sigma = AA^\top$ , so $|\Sigma| = |A|^2$ and $|A| = |\Sigma|^{1/2}$ . Substituting yields the stated form.

Why It Matters

This is the constructive definition: every multivariate normal is an affine image of $\mathcal N(0, I_d)$ . To sample $X \sim \mathcal N(\mu, \Sigma)$ in code you compute the Cholesky factor $\Sigma = LL^\top$ , draw $z \sim \mathcal N(0, I_d)$ , and return $\mu + Lz$ . The same factorization gives a clean route to log-densities: $\log |\Sigma| = 2 \sum_i \log L_{ii}$ , and solving $L y = (x - \mu)$ by forward substitution gives the Mahalanobis distance without inverting $\Sigma$ explicitly.

Failure Mode

If $A$ is singular, the affine map collapses $\mathbb R^d$ onto a lower-dimensional subspace and the density formula breaks (one cannot divide by $|A| = 0$ ). The distribution still exists, but lives on a flat in $\mathbb R^d$ and must be described by its characteristic function, not its Lebesgue density.

report a correction →

Completing the Square: The Canonical Algebraic Move

Almost every Bayesian Gaussian derivation on this site reduces to one move: take a log-density that is a quadratic in $x$ plus a linear term plus a constant, and reorganize it as $-\tfrac12 (x - m)^\top P (x - m) + \text{const}$ to read off the mean $m$ and the precision $P = \Sigma^{-1}$ .

Definition

Completing the Square (Multivariate)

Suppose

$L(x) = -\tfrac12 x^\top P x + b^\top x + c$

with $P \in \mathbb R^{d \times d}$ symmetric positive definite, $b \in \mathbb R^d$ , and $c \in \mathbb R$ . Then

$L(x) = -\tfrac12 (x - P^{-1} b)^\top P (x - P^{-1} b) + \tfrac12 b^\top P^{-1} b + c.$

So if $\exp(L(x))$ is a probability density (up to normalization) in $x$ , then $x$ is Gaussian with mean $m = P^{-1} b$ and precision $P$ (equivalently, covariance $\Sigma = P^{-1}$ ). The constant terms determine the normalization, which usually does not matter for identifying the distribution.

The recipe in three lines:

Collect every term involving $x$ from the log-density.
Match coefficients: $-\tfrac12 P$ is the coefficient of $x^\top \cdot x$ (the quadratic-form coefficient), and $b$ is the coefficient of $x$ (the linear-term vector).
Read off $\Sigma = P^{-1}$ and $\mu = \Sigma b$ .

The proof is one line: expand $-\tfrac12 (x - m)^\top P (x - m) = -\tfrac12 x^\top P x + m^\top P x - \tfrac12 m^\top P m$ . Setting $m = P^{-1} b$ matches the linear term $b^\top x$ . The constant $\tfrac12 m^\top P m = \tfrac12 b^\top P^{-1} b$ is what falls out.

Example

A worked completing-the-square

Suppose

$L(x) = -\tfrac12 x^\top \begin{pmatrix} 2 & 1 \\ 1 & 2 \end{pmatrix} x + \begin{pmatrix} 3 \\ 0 \end{pmatrix}^\top x + 7.$

The precision is $P = \begin{pmatrix} 2 & 1 \\ 1 & 2 \end{pmatrix}$ , so $P^{-1} = \tfrac13 \begin{pmatrix} 2 & -1 \\ -1 & 2 \end{pmatrix}$ , and the linear-coefficient vector is $b = (3, 0)^\top$ . The mean is

$m = P^{-1} b = \tfrac13 \begin{pmatrix} 2 & -1 \\ -1 & 2 \end{pmatrix}\begin{pmatrix} 3 \\ 0 \end{pmatrix} = \begin{pmatrix} 2 \\ -1 \end{pmatrix}.$

The covariance is $\Sigma = P^{-1}$ . The constant $\tfrac12 b^\top P^{-1} b = \tfrac12 \cdot 3 \cdot 2 = 3$ is what survives outside the squared term.

This move is the source of every closed-form Bayesian Gaussian update: see conjugate priors, Bayesian linear regression, and the GP posterior.

Marginals and Conditionals via the Schur Complement

Partition $X \sim \mathcal N(\mu, \Sigma)$ into two blocks of sizes $d_1$ and $d_2$ :

$X = \begin{pmatrix} X_1 \\ X_2 \end{pmatrix}, \qquad \mu = \begin{pmatrix} \mu_1 \\ \mu_2 \end{pmatrix}, \qquad \Sigma = \begin{pmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{pmatrix},$

with $\Sigma_{21} = \Sigma_{12}^\top$ .

Theorem

Gaussian Marginals and Conditionals

Statement

The marginal and conditional distributions are both Gaussian:

$X_1 \sim \mathcal N(\mu_1, \Sigma_{11}),$

$X_1 \mid X_2 = x_2 \;\sim\; \mathcal N\!\left( \mu_1 + \Sigma_{12} \Sigma_{22}^{-1} (x_2 - \mu_2), \;\; \Sigma_{11} - \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21} \right).$

The conditional covariance $\Sigma_{11|2} = \Sigma_{11} - \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21}$ is the Schur complement of $\Sigma_{22}$ in $\Sigma$ .

Intuition

The conditional mean is a linear function of the observation $x_2$ : shift away from $\mu_2$ in the direction $\Sigma_{12} \Sigma_{22}^{-1}$ , which is the regression coefficient of $X_1$ on $X_2$ . The conditional variance is smaller than the marginal variance (in the Loewner order) by exactly the amount of variance explained by $X_2$ — this is the matrix version of " $\mathrm{Var}(Y \mid X) \le \mathrm{Var}(Y)$ , with equality iff $X$ and $Y$ are independent."

Proof Sketch

Use the block-inverse identity for $\Sigma^{-1}$ . Writing $S = \Sigma_{11|2} = \Sigma_{11} - \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21}$ ,

$\Sigma^{-1} = \begin{pmatrix} S^{-1} & -S^{-1} \Sigma_{12} \Sigma_{22}^{-1} \\ -\Sigma_{22}^{-1} \Sigma_{21} S^{-1} & \Sigma_{22}^{-1} + \Sigma_{22}^{-1} \Sigma_{21} S^{-1} \Sigma_{12} \Sigma_{22}^{-1} \end{pmatrix}.$

Expand the joint log-density $-\tfrac12 (x - \mu)^\top \Sigma^{-1} (x - \mu)$ using this block form. The terms involving $x_1$ (with $x_2$ fixed) collect into

$-\tfrac12 (x_1 - \mu_1)^\top S^{-1} (x_1 - \mu_1) + (x_1 - \mu_1)^\top S^{-1} \Sigma_{12} \Sigma_{22}^{-1} (x_2 - \mu_2) + \text{terms in } x_2 \text{ alone}.$

Apply the completing-the-square recipe with $P = S^{-1}$ and $b = S^{-1} \Sigma_{12} \Sigma_{22}^{-1}(x_2 - \mu_2)$ : the mean is $P^{-1} b = \Sigma_{12} \Sigma_{22}^{-1}(x_2 - \mu_2)$ , shifted by $\mu_1$ . Combined, the conditional is $\mathcal N(\mu_1 + \Sigma_{12} \Sigma_{22}^{-1}(x_2 - \mu_2), S)$ .

The marginal of $X_1$ follows from the affine-image theorem: $X_1 = \begin{pmatrix} I_{d_1} & 0 \end{pmatrix} X$ is an affine image of a Gaussian, with mean $\mu_1$ and covariance $\Sigma_{11}$ .

Why It Matters

This is the identity behind Gaussian inference. The Kalman filter's measurement update, the GP posterior, and the Bayesian linear regression predictive distribution are all special cases of this conditioning formula. The Schur complement $\Sigma_{11|2}$ is also what guarantees joint positive definiteness ( $\Sigma \succ 0 \Leftrightarrow \Sigma_{22} \succ 0$ and $\Sigma_{11|2} \succ 0$ ) and quantifies how much $X_2$ tells you about $X_1$ .

Failure Mode

The conditional formula requires $\Sigma_{22}$ invertible. If $\Sigma_{22}$ is singular, $X_2$ has linearly dependent components and the conditioning event $\{X_2 = x_2\}$ either has probability one in a direction (collapsing the conditional dimension) or contradicts the data (probability zero). The pseudo-inverse $\Sigma_{22}^+$ replaces $\Sigma_{22}^{-1}$ in the degenerate case, and the conditional lives on a coset of the null space of $\Sigma_{22}$ .

report a correction →

MGF and Characteristic Function: The Alternative Path

The completing-the-square route is algebraic. A second route uses the moment generating function (MGF) and characteristic function (CF), which often gives shorter proofs of distributional facts and handles the degenerate case cleanly.

For $X \sim \mathcal N(\mu, \Sigma)$ :

$M_X(t) = \mathbb E[e^{t^\top X}] = \exp\!\left(t^\top \mu + \tfrac12 t^\top \Sigma t\right), \qquad \varphi_X(t) = \mathbb E[e^{i t^\top X}] = \exp\!\left(i t^\top \mu - \tfrac12 t^\top \Sigma t\right).$

The CF is well-defined and identifies the distribution even when $\Sigma$ is only positive semi-definite. A standard derivation: by the affine-image theorem, $X = AZ + \mu$ with $\Sigma = AA^\top$ , so $t^\top X = t^\top \mu + (A^\top t)^\top Z$ , and the MGF factors into a product of independent univariate normal MGFs.

Three consequences fall out without integration:

Linear combinations: $a^\top X \sim \mathcal N(a^\top \mu, a^\top \Sigma a)$ . (Apply the MGF identity to $t = sa$ and recognize a univariate normal MGF in $s$ .)
Independence iff zero covariance: $X_1$ and $X_2$ (jointly Gaussian) are independent iff $\Sigma_{12} = 0$ , because the joint CF factors iff the cross term in the quadratic form vanishes.
Affine maps: $AX + b \sim \mathcal N(A\mu + b, A\Sigma A^\top)$ . (One line from the MGF.)

Watch Out

Marginal Gaussianity is not joint Gaussianity

Two normal random variables can have non-Gaussian joint distribution. The standard counterexample: let $X \sim \mathcal N(0,1)$ and define $Y = WX$ where $W$ is an independent random sign ( $\pm 1$ with equal probability). Then $Y \sim \mathcal N(0,1)$ marginally, but $(X, Y)$ is supported on the union of two lines ( $Y = X$ and $Y = -X$ ) and is not jointly Gaussian. The MGF / CF identity defines a joint Gaussian; "each marginal is normal" is strictly weaker.

Watch Out

Uncorrelated Gaussian components are not always independent

The "uncorrelated ⇒ independent" fact requires the pair to be jointly Gaussian. In the counterexample above, $\mathrm{Cov}(X, Y) = \mathbb E[XY] = \mathbb E[W X^2] = \mathbb E[W]\mathbb E[X^2] = 0$ , so $X$ and $Y$ are uncorrelated yet clearly not independent ( $|X| = |Y|$ always). Independence-from-zero-covariance needs the joint to be Gaussian, not just the marginals.

Sampling, Cholesky, and Computational Notes

To sample $X \sim \mathcal N(\mu, \Sigma)$ in $\mathbb R^d$ :

Factor $\Sigma = LL^\top$ by Cholesky (cost $O(d^3)$ ).
Draw $Z \sim \mathcal N(0, I_d)$ ( $d$ independent standard normals).
Return $X = \mu + LZ$ .

For evaluating densities, never invert $\Sigma$ explicitly. Solve $Lw = x - \mu$ by forward substitution ( $O(d^2)$ ); the Mahalanobis distance is $w^\top w$ . The log-determinant is $\log|\Sigma| = 2 \sum_{i=1}^d \log L_{ii}$ .

For high-dimensional problems where $\Sigma$ has special structure (low-rank plus diagonal, Toeplitz, banded), exploit that structure: Cholesky on a generic $d \times d$ matrix is $O(d^3)$ , but Cholesky on a banded matrix with bandwidth $b$ is $O(d b^2)$ , and Cholesky on a Kronecker-structured matrix factors across the Kronecker components.

Worked Two-Dimensional Example

Take $\mu = (1, 2)^\top$ and

$\Sigma = \begin{pmatrix} 4 & 2 \\ 2 & 3 \end{pmatrix}.$

The correlation is $\rho = 2/\sqrt{4 \cdot 3} = 1/\sqrt 3 \approx 0.577$ . The Cholesky factor satisfies

$L = \begin{pmatrix} 2 & 0 \\ 1 & \sqrt 2 \end{pmatrix}, \qquad L L^\top = \begin{pmatrix} 4 & 2 \\ 2 & 3 \end{pmatrix}. \quad \checkmark$

The Schur complement of $\Sigma_{22} = 3$ in $\Sigma$ is $\Sigma_{11|2} = 4 - 2 \cdot 3^{-1} \cdot 2 = 4 - 4/3 = 8/3$ . So

$X_1 \mid X_2 = x_2 \;\sim\; \mathcal N\!\left(1 + \tfrac23 (x_2 - 2), \;\; \tfrac83\right).$

Observing $X_2 = 5$ (three above its mean) shifts the conditional mean of $X_1$ to $1 + \tfrac23 \cdot 3 = 3$ . The conditional variance drops from the marginal $4$ to $8/3 \approx 2.67$ , a $33\%$ reduction — exactly the fraction of variance in $X_1$ that $X_2$ explains, given the correlation.

Useful Identities

The following matrix identities show up so often that it pays to know them by sight. Each one is a one-line consequence of the structure above.

Affine combinations. $aX + bY \sim \mathcal N(a\mu_X + b\mu_Y, a^2 \Sigma_X + b^2 \Sigma_Y + 2ab\, \mathrm{Cov}(X,Y))$ when $(X, Y)$ are jointly Gaussian.
Sum of independent Gaussians. If $X_1 \perp X_2$ are Gaussian, $X_1 + X_2 \sim \mathcal N(\mu_1 + \mu_2, \Sigma_1 + \Sigma_2)$ .
Conditional precision. $\Sigma_{11|2}^{-1}$ is the upper-left block of $\Sigma^{-1}$ . The precision matrix encodes conditional independence: $(\Sigma^{-1})_{ij} = 0$ iff $X_i$ and $X_j$ are conditionally independent given the rest.
Woodbury identity for low-rank updates: $(A + UCV^\top)^{-1} = A^{-1} - A^{-1} U (C^{-1} + V^\top A^{-1} U)^{-1} V^\top A^{-1}$ . Used to invert posterior covariances when the prior covariance has simple structure.
Matrix determinant lemma: $|A + UCV^\top| = |A|\, |C|\, |C^{-1} + V^\top A^{-1} U|$ . Used to compute the marginal likelihood in Bayesian linear regression and the log-determinant of the GP covariance.

Common Confusions

Watch Out

The covariance matrix is not the precision matrix

The covariance $\Sigma$ encodes pairwise variances and covariances of the components; the precision $P = \Sigma^{-1}$ encodes pairwise conditional covariances given the rest. Two coordinates can have zero covariance (independent marginally) but nonzero precision (dependent conditionally on a third), and vice versa. The precision is what controls conditional structure; the covariance is what controls marginal structure. Confusing them is the root of most graphical-model errors.

Watch Out

Positive definite vs positive semi-definite covariance

Most textbook formulas assume $\Sigma \succ 0$ (strictly positive definite). When $\Sigma$ is only PSD, the distribution is degenerate: it concentrates on the affine subspace $\mu + \mathrm{range}(\Sigma)$ . The Lebesgue density does not exist (you cannot divide by $|\Sigma| = 0$ ), but the characteristic function $\exp(it^\top \mu - \tfrac12 t^\top \Sigma t)$ is still well-defined and identifies the distribution. In practice this happens when you condition on too many linearly dependent observations.

Watch Out

The MVN is the maximum-entropy distribution for given mean and covariance

Among all distributions on $\mathbb R^d$ with a fixed mean $\mu$ and covariance $\Sigma$ , the Gaussian $\mathcal N(\mu, \Sigma)$ uniquely maximizes the differential entropy $h(p) = -\int p(x) \log p(x)\, dx$ . This is why the Gaussian shows up as the default in maximum-entropy modeling, why the central limit theorem produces Gaussian limits, and why "I only know the first two moments" implicitly means Gaussian in many applied contexts. It is not because Gaussianity is empirically true; it is because Gaussianity is the least committal second-moment-constrained distribution.

Summary

$X \sim \mathcal N(\mu, \Sigma)$ has density $p(x) = (2\pi)^{-d/2} |\Sigma|^{-1/2} \exp(-\tfrac12 (x - \mu)^\top \Sigma^{-1}(x - \mu))$ , derived from $\mathcal N(0, I_d)$ by the affine map $X = AZ + \mu$ with $\Sigma = AA^\top$ .
The completing-the-square recipe: any log-density of the form $-\tfrac12 x^\top P x + b^\top x + c$ is Gaussian with mean $P^{-1} b$ and covariance $P^{-1}$ .
Marginals are Gaussian: $X_1 \sim \mathcal N(\mu_1, \Sigma_{11})$ .
Conditionals are Gaussian: $X_1 \mid X_2 = x_2 \sim \mathcal N(\mu_1 + \Sigma_{12}\Sigma_{22}^{-1}(x_2 - \mu_2), \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21})$ . The conditional covariance is the Schur complement of $\Sigma_{22}$ in $\Sigma$ .
The characteristic function $\exp(it^\top \mu - \tfrac12 t^\top \Sigma t)$ identifies the distribution even when $\Sigma$ is singular.
Independence equals zero covariance only when the joint is Gaussian. Marginal Gaussianity alone does not imply joint Gaussianity.

Exercises

ExerciseCore

Problem

Let $X = (X_1, X_2)^\top \sim \mathcal N(0, \Sigma)$ with $\Sigma = \begin{pmatrix} 1 & 1/2 \\ 1/2 & 1 \end{pmatrix}$ . Compute the conditional distribution of $X_1$ given $X_2 = 2$ .

ExerciseCore

Problem

Use the MGF to show that if $X \sim \mathcal N(\mu, \Sigma)$ and $A \in \mathbb R^{k \times d}$ is any matrix (not necessarily square or invertible), then $AX \sim \mathcal N(A\mu, A\Sigma A^\top)$ .

ExerciseAdvanced

Problem

Suppose $X \sim \mathcal N(\mu, \Sigma)$ in $\mathbb R^d$ and let $Y = X + \varepsilon$ with $\varepsilon \sim \mathcal N(0, \sigma^2 I_d)$ independent of $X$ . Derive the conditional distribution $X \mid Y = y$ by completing the square in the joint log-density. Verify your answer against the Schur-complement formula applied to the joint of $(X, Y)$ .

ExerciseResearch

Problem

Let $X \sim \mathcal N(0, \Sigma)$ in $\mathbb R^d$ . Show that for any orthogonal matrix $Q$ , the distribution of $QX$ equals the distribution of $X$ iff $\Sigma = Q\Sigma Q^\top$ . Conclude that $\mathcal N(0, I_d)$ is the unique mean-zero $d$ -dimensional Gaussian that is rotation-invariant.

References

Canonical:

Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer. §2.3 (the Gaussian, conditional and marginal Gaussians, Bayes' theorem for Gaussian variables, maximum-likelihood for the Gaussian).
Hogg, R.V., McKean, J.W., & Craig, A.T. (2018). Introduction to Mathematical Statistics, 8th ed. Pearson. Ch. 3 (the multivariate normal at the level of a first graduate course).
Anderson, T.W. (2003). An Introduction to Multivariate Statistical Analysis, 3rd ed. Wiley. Ch. 2–4 (the canonical reference for the algebraic properties).
Mardia, K.V., Kent, J.T., & Bibby, J.M. (1979). Multivariate Analysis. Academic Press. Ch. 3.

Current:

Vershynin, R. (2018). High-Dimensional Probability. Cambridge University Press. Ch. 3. (Concentration and tail bounds for Gaussian vectors in high dimensions.)
Wainwright, M.J. (2019). High-Dimensional Statistics. Cambridge University Press. §2.1–2.3 (sub-Gaussian framework and the Gaussian as the canonical example).
Murphy, K.P. (2022). Probabilistic Machine Learning: Advanced Topics. MIT Press. Ch. 2 (the MVN from the ML perspective).
Gelman, A. et al. (2013). Bayesian Data Analysis, 3rd ed. CRC Press. Appendix A (the multivariate normal as the Bayesian's default noise model).

Next Topics

Conjugate priors: the multivariate normal is the conjugate prior for the mean of another multivariate normal; this is the identity behind every closed-form Bayesian Gaussian model.
Bayesian linear regression: the completing-the-square recipe in action on a regression posterior.
Gaussian processes for ML: the infinite-dimensional generalization, with conditioning given by the same Schur-complement formula.

Last reviewed: May 10, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

7

Common Probability Distributionslayer 0A · tier 1
Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
Joint, Marginal, and Conditional Distributionslayer 0A · tier 1
Positive Semidefinite Matriceslayer 0A · tier 1
The Jacobian Matrixlayer 0A · tier 1

Derived topics

4

Conjugate Priorslayer 0B · tier 1
Bayesian Linear Regressionlayer 2 · tier 1
Gauss-Markov Theoremlayer 2 · tier 1
Gaussian Processes for Machine Learninglayer 4 · tier 3

Graph-backed continuations

Bayesian Linear Regression Conjugate Priors Gaussian Processes for Machine Learning Gauss-Markov Theorem