Foundations
Expectation, Variance, Covariance, and Moments
Expectation, variance, covariance, correlation, linearity of expectation, variance of sums, and moment-based reasoning in ML.
Prerequisites
Why This Matters
Same mean, different variances: variance controls spread around the expectation
Expectation and variance are the two most computed quantities in ML. Expected loss is the population risk. Variance of gradient estimators controls SGD convergence rates. Covariance matrices define the geometry of data distributions and appear in PCA, Gaussian processes, and whitening transforms.
Core Definitions
Expectation
For a discrete random variable: . For a continuous random variable with density : . The expectation exists when the sum or integral is absolutely convergent.
More generally, for a measurable function : .
Variance
The variance measures spread around the mean:
The second form (computational formula) is often easier to use. The standard deviation is .
Covariance
The covariance measures linear association between two random variables:
. For a random vector , the covariance matrix has entries .
Correlation
The Pearson correlation normalizes covariance to :
iff is an affine function of almost surely. means uncorrelated (not necessarily independent).
Moments
The -th moment of is . The -th central moment is . The third central moment (normalized) is skewness. The fourth (normalized) is kurtosis. Heavy-tailed distributions have large kurtosis.
Covariance vs Correlation
Covariance and correlation both measure linear association, but they serve different purposes. Covariance is a bilinear form that participates in algebraic computations: the variance of a sum formula, the covariance matrix in PCA, the Kalman gain equation. Its magnitude depends on the scale of the variables, so is meaningless without knowing the units.
Correlation normalizes away scale: regardless of units. Use correlation for interpretation (how strong is the linear relationship?) and covariance for computation (what is the variance of a portfolio return?).
Two critical points. First, does not imply independence. It only rules out linear dependence. Second, correlation measures linear association only. Variables with strong nonlinear dependence can have . For broader dependence measures, see mutual information or rank correlations (Spearman, Kendall).
Key Properties
Linearity of Expectation
Statement
This extends to any finite sum: .
Intuition
No independence or uncorrelatedness is required. This holds for arbitrary dependence structure. It is the single most-used property in probabilistic analysis.
Proof Sketch
For continuous random variables with joint density :
Split the integral: . The first integral is (integrate out to get the marginal), the second is .
Why It Matters
Linearity makes expected value tractable even for complex random variables. To compute where counts something complicated, decompose into indicator random variables and sum . This trick solves problems in combinatorics, algorithm analysis, and randomized methods where computing the joint distribution would be intractable.
Failure Mode
Linearity does not hold for variance, entropy, or other nonlinear functionals of distributions. in general.
Variance scaling: . Adding a constant shifts the mean but does not change spread. Scaling by scales variance by .
Covariance bilinearity: . Covariance is bilinear, making it an inner product on the space of zero-mean, finite-variance random variables.
Law of Total Variance
Law of Total Variance (Eve's Law)
Statement
Intuition
Total variance decomposes into two sources. is the average variance within each level of (unexplained variance). is the variance of the conditional mean across levels of (explained variance). If knowing perfectly predicts , the first term is zero. If is useless, the second term is zero.
Proof Sketch
Start from . Apply the law of total expectation to . Write . Substitute and regroup to get . The last two terms equal .
Why It Matters
This decomposition is the theoretical basis for ANOVA, the bias-variance decomposition, and hierarchical models. In random effects models, it separates within-group and between-group variation.
Failure Mode
Requires . The conditional variance is itself a random variable (a function of ), not a number.
Chebyshev's Inequality
Chebyshev's Inequality
Statement
Equivalently, for .
Intuition
A random variable with small variance cannot deviate far from its mean with high probability. The bound is distribution-free: it holds for any distribution with finite variance.
Proof Sketch
Apply Markov's inequality to the nonneg random variable :
Since iff , the result follows.
Why It Matters
Chebyshev is the simplest concentration inequality. It proves the weak law of large numbers in two lines: apply Chebyshev to with , getting .
Failure Mode
The bound is loose for specific distributions. For Gaussians, , while Chebyshev gives . Tighter bounds require distributional assumptions; see concentration inequalities for Hoeffding, Bernstein, and sub-Gaussian bounds.
Higher Moments and Moment Generating Functions
The -th moment and the -th central moment capture progressively finer distributional information.
Skewness (third standardized central moment): . Positive skewness indicates a right tail heavier than the left. Zero for any symmetric distribution.
Kurtosis (fourth standardized central moment): . The Gaussian has . Excess kurtosis measures tail heaviness relative to the Gaussian. Heavy-tailed distributions (relevant to financial returns, gradient noise) have large excess kurtosis.
Moment generating function (MGF): , defined for in a neighborhood of zero. When it exists, the MGF uniquely determines the distribution. Its utility: , so all moments are encoded in one function. The MGF of a sum of independent random variables is the product of their MGFs, which is the standard tool for proving the central limit theorem. For distributions where the MGF does not exist (e.g., Cauchy, log-normal), use the characteristic function instead, which always exists.
Main Theorems
Variance of a Sum
Statement
If are pairwise uncorrelated, this reduces to .
Intuition
Variance of a sum depends on both individual variances and how the variables co-vary. Positive correlations inflate the total variance; negative correlations reduce it.
Proof Sketch
Let . Then . Expanding the square gives by linearity of expectation.
Why It Matters
For i.i.d. random variables, . This is why averaging reduces noise and is the basis for the convergence rate in the central limit theorem. In SGD, minibatch averaging reduces gradient variance by a factor of the batch size.
Failure Mode
Requires finite second moments. For heavy-tailed distributions (e.g., Cauchy; see common probability distributions), variance is infinite and this formula is meaningless. Pairwise uncorrelated does not imply independent: the simplification holds under the weaker pairwise uncorrelated condition, but other properties (e.g., concentration inequalities) may need full independence.
Common Confusions
Uncorrelated does not imply independent
and . Then , so and are uncorrelated. But is a deterministic function of , so they are maximally dependent.
E[XY] = E[X]E[Y] requires independence (or uncorrelatedness)
The factorization holds when are uncorrelated (equivalently, ). Independence implies uncorrelatedness, but not vice versa. For nonlinear functions: requires independence, not just uncorrelatedness.
Variance is not linear
unless and are uncorrelated. The cross-term is often forgotten.
Exercises
Problem
Let be i.i.d. with mean and variance . Compute and where .
Problem
Let and have finite second moments. Prove that , i.e., .
References
Canonical:
- Grimmett & Stirzaker, Probability and Random Processes (2020), Chapter 3
- Casella & Berger, Statistical Inference (2002), Chapter 2
- Billingsley, Probability and Measure (1995), Chapters 5 and 21 (expectation, moments, MGFs)
- Feller, An Introduction to Probability Theory and Its Applications, Vol. 2 (1971), Chapter XV (moments, characteristic functions)
For ML context:
- Deisenroth, Faisal, Ong, Mathematics for Machine Learning (2020), Chapter 6
- Blitzstein & Hwang, Introduction to Probability (2019), Chapters 4 and 7 (expectation, joint distributions, covariance)
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
Builds on This
- Batch NormalizationLayer 2
- Bellman EquationsLayer 2
- Bias-Variance TradeoffLayer 2
- Concentration InequalitiesLayer 1
- Fat Tails and Heavy-Tailed DistributionsLayer 2
- Moment Generating FunctionsLayer 0A
- Skewness, Kurtosis, and Higher MomentsLayer 1
- Survey Sampling MethodsLayer 2