Skip to main content

Statistical Estimation

The Multivariate Normal Distribution

The multivariate Gaussian as the joint of d correlated random variables: density derivation from standard normals via affine maps, the completing-the-square recipe, Schur-complement marginals and conditionals, the MGF and characteristic function, and the algebraic identities that power every Bayesian Gaussian derivation downstream.

AdvancedTier 1StableCore spine~85 min

Why This Matters

The multivariate normal is the joint distribution that every Bayesian Gaussian calculation on this site reduces to. Conditioning on observed data in Bayesian linear regression, computing the posterior in a Gaussian process, deriving the Kalman filter, and explaining why ridge regression has a Bayesian interpretation all rely on three pieces of algebra: the density form 12(xμ)Σ1(xμ)-\tfrac12(x-\mu)^\top \Sigma^{-1}(x-\mu), the completing-the-square move that lets you read off the mean and covariance of a posterior from its log-density, and the Schur-complement formula for the conditional distribution of one block given another.

Most textbooks state these facts. This page derives them.

Mental Model

A standard dd-dimensional normal ZN(0,Id)Z \sim \mathcal N(0, I_d) is a dd-tuple of independent N(0,1)\mathcal N(0,1) variables. Apply an affine map X=AZ+μX = AZ + \mu with ARd×dA \in \mathbb R^{d \times d} invertible and Σ=AA\Sigma = AA^\top: every component of XX is a linear combination of independent standard normals, so XX is Gaussian in every direction, with mean μ\mu and covariance Σ\Sigma. That is the multivariate normal.

This construction tells you what to expect:

  • Affine images of Gaussians are Gaussian. Closed under linear maps and translation.
  • Marginals are Gaussian. Project out a block of coordinates: the projection is an affine map.
  • Conditionals are Gaussian. Fix a block and look at the rest: conditioning on a Gaussian is a linear operation in disguise (this is what the Schur complement makes precise).
  • Independence equals zero covariance (a fact special to Gaussians, false in general).

Formal Setup

Definition

Multivariate Normal

A random vector XRdX \in \mathbb R^d has the multivariate normal distribution with mean μRd\mu \in \mathbb R^d and positive definite covariance ΣRd×d\Sigma \in \mathbb R^{d \times d} if it has density

p(x)=1(2π)d/2Σ1/2exp ⁣(12(xμ)Σ1(xμ))for xRd.p(x) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp\!\left(-\tfrac12 (x - \mu)^\top \Sigma^{-1} (x - \mu)\right) \quad \text{for } x \in \mathbb R^d.

When Σ\Sigma is only positive semi-definite (rank r<dr < d), the distribution is degenerate: XX lives in an rr-dimensional affine subspace and has no density with respect to Lebesgue measure on Rd\mathbb R^d. The degenerate case is handled cleanly by the characteristic function (below).

The exponent (xμ)Σ1(xμ)(x-\mu)^\top \Sigma^{-1}(x-\mu) is the squared Mahalanobis distance from xx to μ\mu. Level sets are ellipsoids whose axes are the eigenvectors of Σ\Sigma, scaled by the square roots of its eigenvalues.

Deriving the Density

The cleanest derivation is by affine change of variables from the standard normal.

Theorem

Density of an Affine Image of a Standard Normal

Statement

Let ZN(0,Id)Z \sim \mathcal N(0, I_d) have density φ(z)=(2π)d/2exp(12zz)\varphi(z) = (2\pi)^{-d/2} \exp(-\tfrac12 z^\top z). Define X=AZ+μX = AZ + \mu with AA invertible. Then XX has density

pX(x)=1(2π)d/2Σ1/2exp ⁣(12(xμ)Σ1(xμ))p_X(x) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp\!\left(-\tfrac12 (x-\mu)^\top \Sigma^{-1} (x-\mu)\right)

where Σ=AA\Sigma = AA^\top. So XN(μ,Σ)X \sim \mathcal N(\mu, \Sigma).

Intuition

The standard normal's density has a quadratic exponent 12zz-\tfrac12 z^\top z. An invertible linear map stretches the level sets into ellipsoids and rescales the volume by A|A|, which produces the Σ1\Sigma^{-1} in the exponent and the Σ1/2|\Sigma|^{1/2} in the normalizer.

Proof Sketch

The inverse map is z=A1(xμ)z = A^{-1}(x - \mu), with Jacobian z/x=A1\partial z / \partial x = A^{-1} and A1=1/A|A^{-1}| = 1/|A|. The change-of-variables formula gives

pX(x)=φ(A1(xμ))A1=1(2π)d/2Aexp ⁣(12(A1(xμ))(A1(xμ))).p_X(x) = \varphi(A^{-1}(x-\mu)) \cdot |A^{-1}| = \frac{1}{(2\pi)^{d/2} |A|} \exp\!\left(-\tfrac12 (A^{-1}(x-\mu))^\top (A^{-1}(x-\mu))\right).

Now (A1(xμ))(A1(xμ))=(xμ)(A1)A1(xμ)=(xμ)(AA)1(xμ)(A^{-1}(x-\mu))^\top (A^{-1}(x-\mu)) = (x-\mu)^\top (A^{-1})^\top A^{-1} (x-\mu) = (x-\mu)^\top (AA^\top)^{-1}(x-\mu) using (A1)A1=(AA)1(A^{-1})^\top A^{-1} = (AA^\top)^{-1}. Define Σ=AA\Sigma = AA^\top, so Σ=A2|\Sigma| = |A|^2 and A=Σ1/2|A| = |\Sigma|^{1/2}. Substituting yields the stated form.

Why It Matters

This is the constructive definition: every multivariate normal is an affine image of N(0,Id)\mathcal N(0, I_d). To sample XN(μ,Σ)X \sim \mathcal N(\mu, \Sigma) in code you compute the Cholesky factor Σ=LL\Sigma = LL^\top, draw zN(0,Id)z \sim \mathcal N(0, I_d), and return μ+Lz\mu + Lz. The same factorization gives a clean route to log-densities: logΣ=2ilogLii\log |\Sigma| = 2 \sum_i \log L_{ii}, and solving Ly=(xμ)L y = (x - \mu) by forward substitution gives the Mahalanobis distance without inverting Σ\Sigma explicitly.

Failure Mode

If AA is singular, the affine map collapses Rd\mathbb R^d onto a lower-dimensional subspace and the density formula breaks (one cannot divide by A=0|A| = 0). The distribution still exists, but lives on a flat in Rd\mathbb R^d and must be described by its characteristic function, not its Lebesgue density.

Completing the Square: The Canonical Algebraic Move

Almost every Bayesian Gaussian derivation on this site reduces to one move: take a log-density that is a quadratic in xx plus a linear term plus a constant, and reorganize it as 12(xm)P(xm)+const-\tfrac12 (x - m)^\top P (x - m) + \text{const} to read off the mean mm and the precision P=Σ1P = \Sigma^{-1}.

Definition

Completing the Square (Multivariate)

Suppose

L(x)=12xPx+bx+cL(x) = -\tfrac12 x^\top P x + b^\top x + c

with PRd×dP \in \mathbb R^{d \times d} symmetric positive definite, bRdb \in \mathbb R^d, and cRc \in \mathbb R. Then

L(x)=12(xP1b)P(xP1b)+12bP1b+c.L(x) = -\tfrac12 (x - P^{-1} b)^\top P (x - P^{-1} b) + \tfrac12 b^\top P^{-1} b + c.

So if exp(L(x))\exp(L(x)) is a probability density (up to normalization) in xx, then xx is Gaussian with mean m=P1bm = P^{-1} b and precision PP (equivalently, covariance Σ=P1\Sigma = P^{-1}). The constant terms determine the normalization, which usually does not matter for identifying the distribution.

The recipe in three lines:

  1. Collect every term involving xx from the log-density.
  2. Match coefficients: 12P-\tfrac12 P is the coefficient of xxx^\top \cdot x (the quadratic-form coefficient), and bb is the coefficient of xx (the linear-term vector).
  3. Read off Σ=P1\Sigma = P^{-1} and μ=Σb\mu = \Sigma b.

The proof is one line: expand 12(xm)P(xm)=12xPx+mPx12mPm-\tfrac12 (x - m)^\top P (x - m) = -\tfrac12 x^\top P x + m^\top P x - \tfrac12 m^\top P m. Setting m=P1bm = P^{-1} b matches the linear term bxb^\top x. The constant 12mPm=12bP1b\tfrac12 m^\top P m = \tfrac12 b^\top P^{-1} b is what falls out.

Example

A worked completing-the-square

Suppose

L(x)=12x(2112)x+(30)x+7.L(x) = -\tfrac12 x^\top \begin{pmatrix} 2 & 1 \\ 1 & 2 \end{pmatrix} x + \begin{pmatrix} 3 \\ 0 \end{pmatrix}^\top x + 7.

The precision is P=(2112)P = \begin{pmatrix} 2 & 1 \\ 1 & 2 \end{pmatrix}, so P1=13(2112)P^{-1} = \tfrac13 \begin{pmatrix} 2 & -1 \\ -1 & 2 \end{pmatrix}, and the linear-coefficient vector is b=(3,0)b = (3, 0)^\top. The mean is

m=P1b=13(2112)(30)=(21).m = P^{-1} b = \tfrac13 \begin{pmatrix} 2 & -1 \\ -1 & 2 \end{pmatrix}\begin{pmatrix} 3 \\ 0 \end{pmatrix} = \begin{pmatrix} 2 \\ -1 \end{pmatrix}.

The covariance is Σ=P1\Sigma = P^{-1}. The constant 12bP1b=1232=3\tfrac12 b^\top P^{-1} b = \tfrac12 \cdot 3 \cdot 2 = 3 is what survives outside the squared term.

This move is the source of every closed-form Bayesian Gaussian update: see conjugate priors, Bayesian linear regression, and the GP posterior.

Marginals and Conditionals via the Schur Complement

Partition XN(μ,Σ)X \sim \mathcal N(\mu, \Sigma) into two blocks of sizes d1d_1 and d2d_2:

X=(X1X2),μ=(μ1μ2),Σ=(Σ11Σ12Σ21Σ22),X = \begin{pmatrix} X_1 \\ X_2 \end{pmatrix}, \qquad \mu = \begin{pmatrix} \mu_1 \\ \mu_2 \end{pmatrix}, \qquad \Sigma = \begin{pmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{pmatrix},

with Σ21=Σ12\Sigma_{21} = \Sigma_{12}^\top.

Theorem

Gaussian Marginals and Conditionals

Statement

The marginal and conditional distributions are both Gaussian:

X1N(μ1,Σ11),X_1 \sim \mathcal N(\mu_1, \Sigma_{11}),

X1X2=x2    N ⁣(μ1+Σ12Σ221(x2μ2),    Σ11Σ12Σ221Σ21).X_1 \mid X_2 = x_2 \;\sim\; \mathcal N\!\left( \mu_1 + \Sigma_{12} \Sigma_{22}^{-1} (x_2 - \mu_2), \;\; \Sigma_{11} - \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21} \right).

The conditional covariance Σ112=Σ11Σ12Σ221Σ21\Sigma_{11|2} = \Sigma_{11} - \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21} is the Schur complement of Σ22\Sigma_{22} in Σ\Sigma.

Intuition

The conditional mean is a linear function of the observation x2x_2: shift away from μ2\mu_2 in the direction Σ12Σ221\Sigma_{12} \Sigma_{22}^{-1}, which is the regression coefficient of X1X_1 on X2X_2. The conditional variance is smaller than the marginal variance (in the Loewner order) by exactly the amount of variance explained by X2X_2 — this is the matrix version of "Var(YX)Var(Y)\mathrm{Var}(Y \mid X) \le \mathrm{Var}(Y), with equality iff XX and YY are independent."

Proof Sketch

Use the block-inverse identity for Σ1\Sigma^{-1}. Writing S=Σ112=Σ11Σ12Σ221Σ21S = \Sigma_{11|2} = \Sigma_{11} - \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21},

Σ1=(S1S1Σ12Σ221Σ221Σ21S1Σ221+Σ221Σ21S1Σ12Σ221).\Sigma^{-1} = \begin{pmatrix} S^{-1} & -S^{-1} \Sigma_{12} \Sigma_{22}^{-1} \\ -\Sigma_{22}^{-1} \Sigma_{21} S^{-1} & \Sigma_{22}^{-1} + \Sigma_{22}^{-1} \Sigma_{21} S^{-1} \Sigma_{12} \Sigma_{22}^{-1} \end{pmatrix}.

Expand the joint log-density 12(xμ)Σ1(xμ)-\tfrac12 (x - \mu)^\top \Sigma^{-1} (x - \mu) using this block form. The terms involving x1x_1 (with x2x_2 fixed) collect into

12(x1μ1)S1(x1μ1)+(x1μ1)S1Σ12Σ221(x2μ2)+terms in x2 alone.-\tfrac12 (x_1 - \mu_1)^\top S^{-1} (x_1 - \mu_1) + (x_1 - \mu_1)^\top S^{-1} \Sigma_{12} \Sigma_{22}^{-1} (x_2 - \mu_2) + \text{terms in } x_2 \text{ alone}.

Apply the completing-the-square recipe with P=S1P = S^{-1} and b=S1Σ12Σ221(x2μ2)b = S^{-1} \Sigma_{12} \Sigma_{22}^{-1}(x_2 - \mu_2): the mean is P1b=Σ12Σ221(x2μ2)P^{-1} b = \Sigma_{12} \Sigma_{22}^{-1}(x_2 - \mu_2), shifted by μ1\mu_1. Combined, the conditional is N(μ1+Σ12Σ221(x2μ2),S)\mathcal N(\mu_1 + \Sigma_{12} \Sigma_{22}^{-1}(x_2 - \mu_2), S).

The marginal of X1X_1 follows from the affine-image theorem: X1=(Id10)XX_1 = \begin{pmatrix} I_{d_1} & 0 \end{pmatrix} X is an affine image of a Gaussian, with mean μ1\mu_1 and covariance Σ11\Sigma_{11}.

Why It Matters

This is the identity behind Gaussian inference. The Kalman filter's measurement update, the GP posterior, and the Bayesian linear regression predictive distribution are all special cases of this conditioning formula. The Schur complement Σ112\Sigma_{11|2} is also what guarantees joint positive definiteness (Σ0Σ220\Sigma \succ 0 \Leftrightarrow \Sigma_{22} \succ 0 and Σ1120\Sigma_{11|2} \succ 0) and quantifies how much X2X_2 tells you about X1X_1.

Failure Mode

The conditional formula requires Σ22\Sigma_{22} invertible. If Σ22\Sigma_{22} is singular, X2X_2 has linearly dependent components and the conditioning event {X2=x2}\{X_2 = x_2\} either has probability one in a direction (collapsing the conditional dimension) or contradicts the data (probability zero). The pseudo-inverse Σ22+\Sigma_{22}^+ replaces Σ221\Sigma_{22}^{-1} in the degenerate case, and the conditional lives on a coset of the null space of Σ22\Sigma_{22}.

MGF and Characteristic Function: The Alternative Path

The completing-the-square route is algebraic. A second route uses the moment generating function (MGF) and characteristic function (CF), which often gives shorter proofs of distributional facts and handles the degenerate case cleanly.

For XN(μ,Σ)X \sim \mathcal N(\mu, \Sigma):

MX(t)=E[etX]=exp ⁣(tμ+12tΣt),φX(t)=E[eitX]=exp ⁣(itμ12tΣt).M_X(t) = \mathbb E[e^{t^\top X}] = \exp\!\left(t^\top \mu + \tfrac12 t^\top \Sigma t\right), \qquad \varphi_X(t) = \mathbb E[e^{i t^\top X}] = \exp\!\left(i t^\top \mu - \tfrac12 t^\top \Sigma t\right).

The CF is well-defined and identifies the distribution even when Σ\Sigma is only positive semi-definite. A standard derivation: by the affine-image theorem, X=AZ+μX = AZ + \mu with Σ=AA\Sigma = AA^\top, so tX=tμ+(At)Zt^\top X = t^\top \mu + (A^\top t)^\top Z, and the MGF factors into a product of independent univariate normal MGFs.

Three consequences fall out without integration:

  • Linear combinations: aXN(aμ,aΣa)a^\top X \sim \mathcal N(a^\top \mu, a^\top \Sigma a). (Apply the MGF identity to t=sat = sa and recognize a univariate normal MGF in ss.)
  • Independence iff zero covariance: X1X_1 and X2X_2 (jointly Gaussian) are independent iff Σ12=0\Sigma_{12} = 0, because the joint CF factors iff the cross term in the quadratic form vanishes.
  • Affine maps: AX+bN(Aμ+b,AΣA)AX + b \sim \mathcal N(A\mu + b, A\Sigma A^\top). (One line from the MGF.)
Watch Out

Marginal Gaussianity is not joint Gaussianity

Two normal random variables can have non-Gaussian joint distribution. The standard counterexample: let XN(0,1)X \sim \mathcal N(0,1) and define Y=WXY = WX where WW is an independent random sign (±1\pm 1 with equal probability). Then YN(0,1)Y \sim \mathcal N(0,1) marginally, but (X,Y)(X, Y) is supported on the union of two lines (Y=XY = X and Y=XY = -X) and is not jointly Gaussian. The MGF / CF identity defines a joint Gaussian; "each marginal is normal" is strictly weaker.

Watch Out

Uncorrelated Gaussian components are not always independent

The "uncorrelated ⇒ independent" fact requires the pair to be jointly Gaussian. In the counterexample above, Cov(X,Y)=E[XY]=E[WX2]=E[W]E[X2]=0\mathrm{Cov}(X, Y) = \mathbb E[XY] = \mathbb E[W X^2] = \mathbb E[W]\mathbb E[X^2] = 0, so XX and YY are uncorrelated yet clearly not independent (X=Y|X| = |Y| always). Independence-from-zero-covariance needs the joint to be Gaussian, not just the marginals.

Sampling, Cholesky, and Computational Notes

To sample XN(μ,Σ)X \sim \mathcal N(\mu, \Sigma) in Rd\mathbb R^d:

  1. Factor Σ=LL\Sigma = LL^\top by Cholesky (cost O(d3)O(d^3)).
  2. Draw ZN(0,Id)Z \sim \mathcal N(0, I_d) (dd independent standard normals).
  3. Return X=μ+LZX = \mu + LZ.

For evaluating densities, never invert Σ\Sigma explicitly. Solve Lw=xμLw = x - \mu by forward substitution (O(d2)O(d^2)); the Mahalanobis distance is www^\top w. The log-determinant is logΣ=2i=1dlogLii\log|\Sigma| = 2 \sum_{i=1}^d \log L_{ii}.

For high-dimensional problems where Σ\Sigma has special structure (low-rank plus diagonal, Toeplitz, banded), exploit that structure: Cholesky on a generic d×dd \times d matrix is O(d3)O(d^3), but Cholesky on a banded matrix with bandwidth bb is O(db2)O(d b^2), and Cholesky on a Kronecker-structured matrix factors across the Kronecker components.

Worked Two-Dimensional Example

Take μ=(1,2)\mu = (1, 2)^\top and

Σ=(4223).\Sigma = \begin{pmatrix} 4 & 2 \\ 2 & 3 \end{pmatrix}.

The correlation is ρ=2/43=1/30.577\rho = 2/\sqrt{4 \cdot 3} = 1/\sqrt 3 \approx 0.577. The Cholesky factor satisfies

L=(2012),LL=(4223).L = \begin{pmatrix} 2 & 0 \\ 1 & \sqrt 2 \end{pmatrix}, \qquad L L^\top = \begin{pmatrix} 4 & 2 \\ 2 & 3 \end{pmatrix}. \quad \checkmark

The Schur complement of Σ22=3\Sigma_{22} = 3 in Σ\Sigma is Σ112=42312=44/3=8/3\Sigma_{11|2} = 4 - 2 \cdot 3^{-1} \cdot 2 = 4 - 4/3 = 8/3. So

X1X2=x2    N ⁣(1+23(x22),    83).X_1 \mid X_2 = x_2 \;\sim\; \mathcal N\!\left(1 + \tfrac23 (x_2 - 2), \;\; \tfrac83\right).

Observing X2=5X_2 = 5 (three above its mean) shifts the conditional mean of X1X_1 to 1+233=31 + \tfrac23 \cdot 3 = 3. The conditional variance drops from the marginal 44 to 8/32.678/3 \approx 2.67, a 33%33\% reduction — exactly the fraction of variance in X1X_1 that X2X_2 explains, given the correlation.

Useful Identities

The following matrix identities show up so often that it pays to know them by sight. Each one is a one-line consequence of the structure above.

  • Affine combinations. aX+bYN(aμX+bμY,a2ΣX+b2ΣY+2abCov(X,Y))aX + bY \sim \mathcal N(a\mu_X + b\mu_Y, a^2 \Sigma_X + b^2 \Sigma_Y + 2ab\, \mathrm{Cov}(X,Y)) when (X,Y)(X, Y) are jointly Gaussian.
  • Sum of independent Gaussians. If X1X2X_1 \perp X_2 are Gaussian, X1+X2N(μ1+μ2,Σ1+Σ2)X_1 + X_2 \sim \mathcal N(\mu_1 + \mu_2, \Sigma_1 + \Sigma_2).
  • Conditional precision. Σ1121\Sigma_{11|2}^{-1} is the upper-left block of Σ1\Sigma^{-1}. The precision matrix encodes conditional independence: (Σ1)ij=0(\Sigma^{-1})_{ij} = 0 iff XiX_i and XjX_j are conditionally independent given the rest.
  • Woodbury identity for low-rank updates: (A+UCV)1=A1A1U(C1+VA1U)1VA1(A + UCV^\top)^{-1} = A^{-1} - A^{-1} U (C^{-1} + V^\top A^{-1} U)^{-1} V^\top A^{-1}. Used to invert posterior covariances when the prior covariance has simple structure.
  • Matrix determinant lemma: A+UCV=ACC1+VA1U|A + UCV^\top| = |A|\, |C|\, |C^{-1} + V^\top A^{-1} U|. Used to compute the marginal likelihood in Bayesian linear regression and the log-determinant of the GP covariance.

Common Confusions

Watch Out

The covariance matrix is not the precision matrix

The covariance Σ\Sigma encodes pairwise variances and covariances of the components; the precision P=Σ1P = \Sigma^{-1} encodes pairwise conditional covariances given the rest. Two coordinates can have zero covariance (independent marginally) but nonzero precision (dependent conditionally on a third), and vice versa. The precision is what controls conditional structure; the covariance is what controls marginal structure. Confusing them is the root of most graphical-model errors.

Watch Out

Positive definite vs positive semi-definite covariance

Most textbook formulas assume Σ0\Sigma \succ 0 (strictly positive definite). When Σ\Sigma is only PSD, the distribution is degenerate: it concentrates on the affine subspace μ+range(Σ)\mu + \mathrm{range}(\Sigma). The Lebesgue density does not exist (you cannot divide by Σ=0|\Sigma| = 0), but the characteristic function exp(itμ12tΣt)\exp(it^\top \mu - \tfrac12 t^\top \Sigma t) is still well-defined and identifies the distribution. In practice this happens when you condition on too many linearly dependent observations.

Watch Out

The MVN is the maximum-entropy distribution for given mean and covariance

Among all distributions on Rd\mathbb R^d with a fixed mean μ\mu and covariance Σ\Sigma, the Gaussian N(μ,Σ)\mathcal N(\mu, \Sigma) uniquely maximizes the differential entropy h(p)=p(x)logp(x)dxh(p) = -\int p(x) \log p(x)\, dx. This is why the Gaussian shows up as the default in maximum-entropy modeling, why the central limit theorem produces Gaussian limits, and why "I only know the first two moments" implicitly means Gaussian in many applied contexts. It is not because Gaussianity is empirically true; it is because Gaussianity is the least committal second-moment-constrained distribution.

Summary

  • XN(μ,Σ)X \sim \mathcal N(\mu, \Sigma) has density p(x)=(2π)d/2Σ1/2exp(12(xμ)Σ1(xμ))p(x) = (2\pi)^{-d/2} |\Sigma|^{-1/2} \exp(-\tfrac12 (x - \mu)^\top \Sigma^{-1}(x - \mu)), derived from N(0,Id)\mathcal N(0, I_d) by the affine map X=AZ+μX = AZ + \mu with Σ=AA\Sigma = AA^\top.
  • The completing-the-square recipe: any log-density of the form 12xPx+bx+c-\tfrac12 x^\top P x + b^\top x + c is Gaussian with mean P1bP^{-1} b and covariance P1P^{-1}.
  • Marginals are Gaussian: X1N(μ1,Σ11)X_1 \sim \mathcal N(\mu_1, \Sigma_{11}).
  • Conditionals are Gaussian: X1X2=x2N(μ1+Σ12Σ221(x2μ2),Σ11Σ12Σ221Σ21)X_1 \mid X_2 = x_2 \sim \mathcal N(\mu_1 + \Sigma_{12}\Sigma_{22}^{-1}(x_2 - \mu_2), \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}). The conditional covariance is the Schur complement of Σ22\Sigma_{22} in Σ\Sigma.
  • The characteristic function exp(itμ12tΣt)\exp(it^\top \mu - \tfrac12 t^\top \Sigma t) identifies the distribution even when Σ\Sigma is singular.
  • Independence equals zero covariance only when the joint is Gaussian. Marginal Gaussianity alone does not imply joint Gaussianity.

Exercises

ExerciseCore

Problem

Let X=(X1,X2)N(0,Σ)X = (X_1, X_2)^\top \sim \mathcal N(0, \Sigma) with Σ=(11/21/21)\Sigma = \begin{pmatrix} 1 & 1/2 \\ 1/2 & 1 \end{pmatrix}. Compute the conditional distribution of X1X_1 given X2=2X_2 = 2.

ExerciseCore

Problem

Use the MGF to show that if XN(μ,Σ)X \sim \mathcal N(\mu, \Sigma) and ARk×dA \in \mathbb R^{k \times d} is any matrix (not necessarily square or invertible), then AXN(Aμ,AΣA)AX \sim \mathcal N(A\mu, A\Sigma A^\top).

ExerciseAdvanced

Problem

Suppose XN(μ,Σ)X \sim \mathcal N(\mu, \Sigma) in Rd\mathbb R^d and let Y=X+εY = X + \varepsilon with εN(0,σ2Id)\varepsilon \sim \mathcal N(0, \sigma^2 I_d) independent of XX. Derive the conditional distribution XY=yX \mid Y = y by completing the square in the joint log-density. Verify your answer against the Schur-complement formula applied to the joint of (X,Y)(X, Y).

ExerciseResearch

Problem

Let XN(0,Σ)X \sim \mathcal N(0, \Sigma) in Rd\mathbb R^d. Show that for any orthogonal matrix QQ, the distribution of QXQX equals the distribution of XX iff Σ=QΣQ\Sigma = Q\Sigma Q^\top. Conclude that N(0,Id)\mathcal N(0, I_d) is the unique mean-zero dd-dimensional Gaussian that is rotation-invariant.

References

Canonical:

  • Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer. §2.3 (the Gaussian, conditional and marginal Gaussians, Bayes' theorem for Gaussian variables, maximum-likelihood for the Gaussian).
  • Hogg, R.V., McKean, J.W., & Craig, A.T. (2018). Introduction to Mathematical Statistics, 8th ed. Pearson. Ch. 3 (the multivariate normal at the level of a first graduate course).
  • Anderson, T.W. (2003). An Introduction to Multivariate Statistical Analysis, 3rd ed. Wiley. Ch. 2–4 (the canonical reference for the algebraic properties).
  • Mardia, K.V., Kent, J.T., & Bibby, J.M. (1979). Multivariate Analysis. Academic Press. Ch. 3.

Current:

  • Vershynin, R. (2018). High-Dimensional Probability. Cambridge University Press. Ch. 3. (Concentration and tail bounds for Gaussian vectors in high dimensions.)
  • Wainwright, M.J. (2019). High-Dimensional Statistics. Cambridge University Press. §2.1–2.3 (sub-Gaussian framework and the Gaussian as the canonical example).
  • Murphy, K.P. (2022). Probabilistic Machine Learning: Advanced Topics. MIT Press. Ch. 2 (the MVN from the ML perspective).
  • Gelman, A. et al. (2013). Bayesian Data Analysis, 3rd ed. CRC Press. Appendix A (the multivariate normal as the Bayesian's default noise model).

Next Topics

  • Conjugate priors: the multivariate normal is the conjugate prior for the mean of another multivariate normal; this is the identity behind every closed-form Bayesian Gaussian model.
  • Bayesian linear regression: the completing-the-square recipe in action on a regression posterior.
  • Gaussian processes for ML: the infinite-dimensional generalization, with conditioning given by the same Schur-complement formula.

Last reviewed: May 10, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Derived topics

4