Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Foundations

Expectation, Variance, Covariance, and Moments

Expectation, variance, covariance, correlation, linearity of expectation, variance of sums, and moment-based reasoning in ML.

CoreTier 1Stable~55 min

Why This Matters

Same mean, different variances: variance controls spread around the expectation

mu = 0density p(x)-4-2024xsigma = 0.5 (low variance)sigma = 1.0 (unit variance)sigma = 2.0 (high variance)Var(X) = E[(X - mu)^2]Chebyshev: P(|X - mu| ≥ t) ≤ Var(X)/t^2

Expectation and variance are the two most computed quantities in ML. Expected loss is the population risk. Variance of gradient estimators controls SGD convergence rates. Covariance matrices define the geometry of data distributions and appear in PCA, Gaussian processes, and whitening transforms.

Core Definitions

Definition

Expectation

For a discrete random variable: E[X]=xxP(X=x)\mathbb{E}[X] = \sum_x x \, P(X = x). For a continuous random variable with density ff: E[X]=xf(x)dx\mathbb{E}[X] = \int_{-\infty}^{\infty} x \, f(x) \, dx. The expectation exists when the sum or integral is absolutely convergent.

More generally, for a measurable function gg: E[g(X)]=g(x)dPX(x)\mathbb{E}[g(X)] = \int g(x) \, dP_X(x).

Definition

Variance

The variance measures spread around the mean:

Var(X)=E[(XE[X])2]=E[X2](E[X])2\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2

The second form (computational formula) is often easier to use. The standard deviation is σ=Var(X)\sigma = \sqrt{\text{Var}(X)}.

Definition

Covariance

The covariance measures linear association between two random variables:

Cov(X,Y)=E[(XE[X])(YE[Y])]=E[XY]E[X]E[Y]\text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]

Cov(X,X)=Var(X)\text{Cov}(X, X) = \text{Var}(X). For a random vector XRdX \in \mathbb{R}^d, the covariance matrix has entries Σij=Cov(Xi,Xj)\Sigma_{ij} = \text{Cov}(X_i, X_j).

Definition

Correlation

The Pearson correlation normalizes covariance to [1,1][-1, 1]:

ρ(X,Y)=Cov(X,Y)Var(X)Var(Y)\rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X) \, \text{Var}(Y)}}

ρ=1|\rho| = 1 iff YY is an affine function of XX almost surely. ρ=0\rho = 0 means uncorrelated (not necessarily independent).

Definition

Moments

The kk-th moment of XX is E[Xk]\mathbb{E}[X^k]. The kk-th central moment is E[(XE[X])k]\mathbb{E}[(X - \mathbb{E}[X])^k]. The third central moment (normalized) is skewness. The fourth (normalized) is kurtosis. Heavy-tailed distributions have large kurtosis.

Covariance vs Correlation

Covariance and correlation both measure linear association, but they serve different purposes. Covariance is a bilinear form that participates in algebraic computations: the variance of a sum formula, the covariance matrix in PCA, the Kalman gain equation. Its magnitude depends on the scale of the variables, so Cov(X,Y)=500\text{Cov}(X, Y) = 500 is meaningless without knowing the units.

Correlation normalizes away scale: ρ(X,Y)[1,1]\rho(X, Y) \in [-1, 1] regardless of units. Use correlation for interpretation (how strong is the linear relationship?) and covariance for computation (what is the variance of a portfolio return?).

Two critical points. First, ρ=0\rho = 0 does not imply independence. It only rules out linear dependence. Second, correlation measures linear association only. Variables with strong nonlinear dependence can have ρ=0\rho = 0. For broader dependence measures, see mutual information or rank correlations (Spearman, Kendall).

Key Properties

Theorem

Linearity of Expectation

Statement

E[aX+bY]=aE[X]+bE[Y]\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]

This extends to any finite sum: E[i=1naiXi]=i=1naiE[Xi]\mathbb{E}[\sum_{i=1}^n a_i X_i] = \sum_{i=1}^n a_i \mathbb{E}[X_i].

Intuition

No independence or uncorrelatedness is required. This holds for arbitrary dependence structure. It is the single most-used property in probabilistic analysis.

Proof Sketch

For continuous random variables with joint density fX,Yf_{X,Y}:

E[aX+bY]=(ax+by)fX,Y(x,y)dxdy\mathbb{E}[aX + bY] = \iint (ax + by) f_{X,Y}(x,y) \, dx \, dy

Split the integral: axfX,Y(x,y)dxdy+byfX,Y(x,y)dxdya \iint x \, f_{X,Y}(x,y) \, dx \, dy + b \iint y \, f_{X,Y}(x,y) \, dx \, dy. The first integral is aE[X]a\mathbb{E}[X] (integrate out yy to get the marginal), the second is bE[Y]b\mathbb{E}[Y].

Why It Matters

Linearity makes expected value tractable even for complex random variables. To compute E[X]\mathbb{E}[X] where XX counts something complicated, decompose X=XiX = \sum X_i into indicator random variables and sum E[Xi]\mathbb{E}[X_i]. This trick solves problems in combinatorics, algorithm analysis, and randomized methods where computing the joint distribution would be intractable.

Failure Mode

Linearity does not hold for variance, entropy, or other nonlinear functionals of distributions. Var(X+Y)Var(X)+Var(Y)\text{Var}(X + Y) \neq \text{Var}(X) + \text{Var}(Y) in general.

Variance scaling: Var(aX+b)=a2Var(X)\text{Var}(aX + b) = a^2 \text{Var}(X). Adding a constant shifts the mean but does not change spread. Scaling by aa scales variance by a2a^2.

Covariance bilinearity: Cov(aX+bY,Z)=aCov(X,Z)+bCov(Y,Z)\text{Cov}(aX + bY, Z) = a \, \text{Cov}(X, Z) + b \, \text{Cov}(Y, Z). Covariance is bilinear, making it an inner product on the space of zero-mean, finite-variance random variables.

Law of Total Variance

Theorem

Law of Total Variance (Eve's Law)

Statement

Var(X)=E[Var(XY)]+Var(E[XY])\text{Var}(X) = \mathbb{E}[\text{Var}(X \mid Y)] + \text{Var}(\mathbb{E}[X \mid Y])

Intuition

Total variance decomposes into two sources. E[Var(XY)]\mathbb{E}[\text{Var}(X \mid Y)] is the average variance within each level of YY (unexplained variance). Var(E[XY])\text{Var}(\mathbb{E}[X \mid Y]) is the variance of the conditional mean across levels of YY (explained variance). If knowing YY perfectly predicts XX, the first term is zero. If YY is useless, the second term is zero.

Proof Sketch

Start from Var(X)=E[X2](E[X])2\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2. Apply the law of total expectation to E[X2]=E[E[X2Y]]\mathbb{E}[X^2] = \mathbb{E}[\mathbb{E}[X^2 \mid Y]]. Write E[X2Y]=Var(XY)+(E[XY])2\mathbb{E}[X^2 \mid Y] = \text{Var}(X \mid Y) + (\mathbb{E}[X \mid Y])^2. Substitute and regroup to get E[Var(XY)]+E[(E[XY])2](E[E[XY]])2\mathbb{E}[\text{Var}(X \mid Y)] + \mathbb{E}[(\mathbb{E}[X \mid Y])^2] - (\mathbb{E}[\mathbb{E}[X \mid Y]])^2. The last two terms equal Var(E[XY])\text{Var}(\mathbb{E}[X \mid Y]).

Why It Matters

This decomposition is the theoretical basis for ANOVA, the bias-variance decomposition, and hierarchical models. In random effects models, it separates within-group and between-group variation.

Failure Mode

Requires Var(X)<\text{Var}(X) < \infty. The conditional variance Var(XY)\text{Var}(X \mid Y) is itself a random variable (a function of YY), not a number.

Chebyshev's Inequality

Theorem

Chebyshev's Inequality

Statement

P(Xμt)σ2t2P(|X - \mu| \geq t) \leq \frac{\sigma^2}{t^2}

Equivalently, P(Xμkσ)1/k2P(|X - \mu| \geq k\sigma) \leq 1/k^2 for k>0k > 0.

Intuition

A random variable with small variance cannot deviate far from its mean with high probability. The bound is distribution-free: it holds for any distribution with finite variance.

Proof Sketch

Apply Markov's inequality to the nonneg random variable (Xμ)2(X - \mu)^2:

P((Xμ)2t2)E[(Xμ)2]t2=σ2t2P((X - \mu)^2 \geq t^2) \leq \frac{\mathbb{E}[(X - \mu)^2]}{t^2} = \frac{\sigma^2}{t^2}

Since (Xμ)2t2(X - \mu)^2 \geq t^2 iff Xμt|X - \mu| \geq t, the result follows.

Why It Matters

Chebyshev is the simplest concentration inequality. It proves the weak law of large numbers in two lines: apply Chebyshev to Xˉn\bar{X}_n with Var(Xˉn)=σ2/n\text{Var}(\bar{X}_n) = \sigma^2/n, getting P(Xˉnμϵ)σ2/(nϵ2)0P(|\bar{X}_n - \mu| \geq \epsilon) \leq \sigma^2/(n\epsilon^2) \to 0.

Failure Mode

The bound is loose for specific distributions. For Gaussians, P(Xμ2σ)0.046P(|X - \mu| \geq 2\sigma) \approx 0.046, while Chebyshev gives 0.25\leq 0.25. Tighter bounds require distributional assumptions; see concentration inequalities for Hoeffding, Bernstein, and sub-Gaussian bounds.

Higher Moments and Moment Generating Functions

The kk-th moment E[Xk]\mathbb{E}[X^k] and the kk-th central moment E[(Xμ)k]\mathbb{E}[(X - \mu)^k] capture progressively finer distributional information.

Skewness (third standardized central moment): γ1=E[(Xμ)3]/σ3\gamma_1 = \mathbb{E}[(X - \mu)^3] / \sigma^3. Positive skewness indicates a right tail heavier than the left. Zero for any symmetric distribution.

Kurtosis (fourth standardized central moment): γ2=E[(Xμ)4]/σ4\gamma_2 = \mathbb{E}[(X - \mu)^4] / \sigma^4. The Gaussian has γ2=3\gamma_2 = 3. Excess kurtosis γ23\gamma_2 - 3 measures tail heaviness relative to the Gaussian. Heavy-tailed distributions (relevant to financial returns, gradient noise) have large excess kurtosis.

Moment generating function (MGF): MX(t)=E[etX]M_X(t) = \mathbb{E}[e^{tX}], defined for tt in a neighborhood of zero. When it exists, the MGF uniquely determines the distribution. Its utility: MX(k)(0)=E[Xk]M_X^{(k)}(0) = \mathbb{E}[X^k], so all moments are encoded in one function. The MGF of a sum of independent random variables is the product of their MGFs, which is the standard tool for proving the central limit theorem. For distributions where the MGF does not exist (e.g., Cauchy, log-normal), use the characteristic function ϕX(t)=E[eitX]\phi_X(t) = \mathbb{E}[e^{itX}] instead, which always exists.

Main Theorems

Theorem

Variance of a Sum

Statement

Var ⁣(i=1nXi)=i=1nVar(Xi)+2i<jCov(Xi,Xj)\text{Var}\!\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n \text{Var}(X_i) + 2 \sum_{i < j} \text{Cov}(X_i, X_j)

If X1,,XnX_1, \ldots, X_n are pairwise uncorrelated, this reduces to Var(Xi)=Var(Xi)\text{Var}(\sum X_i) = \sum \text{Var}(X_i).

Intuition

Variance of a sum depends on both individual variances and how the variables co-vary. Positive correlations inflate the total variance; negative correlations reduce it.

Proof Sketch

Let S=XiS = \sum X_i. Then Var(S)=E[(SE[S])2]=E[((Xiμi))2]\text{Var}(S) = \mathbb{E}[(S - \mathbb{E}[S])^2] = \mathbb{E}[(\sum (X_i - \mu_i))^2]. Expanding the square gives iVar(Xi)+2i<jCov(Xi,Xj)\sum_i \text{Var}(X_i) + 2\sum_{i<j} \text{Cov}(X_i, X_j) by linearity of expectation.

Why It Matters

For i.i.d. random variables, Var(Xˉ)=Var(X)/n\text{Var}(\bar{X}) = \text{Var}(X)/n. This is why averaging reduces noise and is the basis for the 1/n1/\sqrt{n} convergence rate in the central limit theorem. In SGD, minibatch averaging reduces gradient variance by a factor of the batch size.

Failure Mode

Requires finite second moments. For heavy-tailed distributions (e.g., Cauchy; see common probability distributions), variance is infinite and this formula is meaningless. Pairwise uncorrelated does not imply independent: the simplification Var(Xi)=Var(Xi)\text{Var}(\sum X_i) = \sum \text{Var}(X_i) holds under the weaker pairwise uncorrelated condition, but other properties (e.g., concentration inequalities) may need full independence.

Common Confusions

Watch Out

Uncorrelated does not imply independent

XUniform({1,0,1})X \sim \text{Uniform}(\{-1, 0, 1\}) and Y=X2Y = X^2. Then Cov(X,Y)=E[X3]E[X]E[X2]=00=0\text{Cov}(X, Y) = \mathbb{E}[X^3] - \mathbb{E}[X]\mathbb{E}[X^2] = 0 - 0 = 0, so XX and YY are uncorrelated. But YY is a deterministic function of XX, so they are maximally dependent.

Watch Out

E[XY] = E[X]E[Y] requires independence (or uncorrelatedness)

The factorization E[XY]=E[X]E[Y]\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y] holds when X,YX, Y are uncorrelated (equivalently, Cov(X,Y)=0\text{Cov}(X,Y) = 0). Independence implies uncorrelatedness, but not vice versa. For nonlinear functions: E[g(X)h(Y)]=E[g(X)]E[h(Y)]\mathbb{E}[g(X)h(Y)] = \mathbb{E}[g(X)]\mathbb{E}[h(Y)] requires independence, not just uncorrelatedness.

Watch Out

Variance is not linear

Var(X+Y)Var(X)+Var(Y)\text{Var}(X + Y) \neq \text{Var}(X) + \text{Var}(Y) unless XX and YY are uncorrelated. The cross-term 2Cov(X,Y)2\text{Cov}(X, Y) is often forgotten.

Exercises

ExerciseCore

Problem

Let X1,,XnX_1, \ldots, X_n be i.i.d. with mean μ\mu and variance σ2\sigma^2. Compute E[Xˉ]\mathbb{E}[\bar{X}] and Var(Xˉ)\text{Var}(\bar{X}) where Xˉ=1ni=1nXi\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i.

ExerciseAdvanced

Problem

Let XX and YY have finite second moments. Prove that Cov(X,Y)Var(X)Var(Y)|\text{Cov}(X, Y)| \leq \sqrt{\text{Var}(X)\text{Var}(Y)}, i.e., ρ(X,Y)1|\rho(X,Y)| \leq 1.

References

Canonical:

  • Grimmett & Stirzaker, Probability and Random Processes (2020), Chapter 3
  • Casella & Berger, Statistical Inference (2002), Chapter 2
  • Billingsley, Probability and Measure (1995), Chapters 5 and 21 (expectation, moments, MGFs)
  • Feller, An Introduction to Probability Theory and Its Applications, Vol. 2 (1971), Chapter XV (moments, characteristic functions)

For ML context:

  • Deisenroth, Faisal, Ong, Mathematics for Machine Learning (2020), Chapter 6
  • Blitzstein & Hwang, Introduction to Probability (2019), Chapters 4 and 7 (expectation, joint distributions, covariance)

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics