Skip to main content

Foundations

Random Variables

Random variables as measurement rules, their distributions, expectation and variance, and the rigorous measurable-map definition used in probability theory.

CoreTier 1Stable~70 min

Why This Matters

If you are new to probability, read a random variable as a measurement whose value is unknown before the experiment happens. Roll a die and record the face. Flip ten coins and record the number of heads. Draw a dataset and record the validation accuracy of a model trained on it. Each recorded quantity is a random variable.

That beginner sentence is useful, but it hides one important detail: the random variable is the rule, not the value you happened to observe. The die roll can land on 44, but the random variable is the rule "return the face showing on the die." The next roll may produce a different value; the rule is the same.

This matters in ML because almost every performance number is a random variable: training loss depends on the sampled dataset, validation accuracy depends on the split, a stochastic gradient depends on the minibatch, and a model output can depend on noisy inputs. Once that is clear, expectation means "average this quantity over repeated draws," and variance means "how much this quantity changes across draws."

Quick Version, No Measure Theory Yet

You can understand most first-course probability with this four-part picture:

ObjectPlain meaningExample
ExperimentThe random thing that happensRoll a die
OutcomeOne realized resultThe die shows 44
Random variableA rule that turns an outcome into a numberX=X= die face
DistributionThe probabilities of the possible output valuesP(X=k)=1/6\mathbb P(X=k)=1/6 for k=1,,6k=1,\ldots,6

The notation XX does two jobs. Before the experiment, XX is an unknown quantity with a distribution. After the experiment, you observe one value such as X=4X=4. Most confusion comes from mixing up those two viewpoints.

Example

Three random variables from the same die roll

Roll one fair die.

  • X=X= the face value, so X{1,2,3,4,5,6}X\in\{1,2,3,4,5,6\}.
  • Y=Y= whether the roll is even, coded as 11 for even and 00 for odd.
  • Z=(X3.5)2Z=(X-3.5)^2, the squared distance from the middle.

The same outcome feeds three different rules. If the die shows 44, then X=4X=4, Y=1Y=1, and Z=0.25Z=0.25. The rules are fixed before the roll; only the realized outcome changes.

For many applications, this is enough: define the quantity you care about, write down or estimate its distribution, then compute probabilities, expectations, variances, or tail bounds. The rigorous definition below exists to make that same idea work on infinite spaces, continuous variables, random vectors, stochastic processes, and conditional information.

Formal Definition

Now translate the friendly version into probability theory. The experiment is modeled by a probability space (Ω,F,P)(\Omega,\mathcal F,\mathbb P). The set Ω\Omega contains outcomes, F\mathcal F contains the events whose probabilities are defined, and P\mathbb P assigns those probabilities. A real-valued random variable is a number-producing rule on that space.

Definition

Random Variable

Let (Ω,F,P)(\Omega,\mathcal F,\mathbb P) be a probability space. A real-valued random variable is a measurable function

X:(Ω,F)(R,B(R)),X:(\Omega,\mathcal F)\to(\mathbb R,\mathcal B(\mathbb R)),

meaning that for every Borel set BB(R)B\in\mathcal B(\mathbb R),

X1(B)={ωΩ:X(ω)B}F.X^{-1}(B)=\{\omega\in\Omega:X(\omega)\in B\}\in\mathcal F.

The measurability condition says every numerical event about XX is an event whose probability is defined.

For example, the event {Xt}\{X\le t\} is shorthand for {ω:X(ω)t}\{\omega:X(\omega)\le t\}. If XX were not measurable, this set might not belong to F\mathcal F, so P(Xt)\mathbb P(X\le t) would not be meaningful.

The probability measure is not needed to say that XX is measurable; it is needed to assign probabilities to the pulled-back events. More generally, a random element is a measurable map from (Ω,F)(\Omega,\mathcal F) into another measurable space (S,S)(S,\mathcal S). Real-valued random variables use (S,S)=(R,B(R))(S,\mathcal S)=(\mathbb R,\mathcal B(\mathbb R)). Random vectors use (Rd,B(Rd))(\mathbb R^d,\mathcal B(\mathbb R^d)).

Definition

Distribution or Law

The distribution or law of XX is the probability measure PXP_X on (R,B(R))(\mathbb R,\mathcal B(\mathbb R)) defined by

PX(B)=P(XB)=P(X1(B)).P_X(B)=\mathbb P(X\in B)=\mathbb P(X^{-1}(B)).

This is also called the pushforward measure of P\mathbb P through XX, written PX=X#PP_X=X_\#\mathbb P.

The random variable and its distribution are different objects. The random variable is a function on outcomes. The distribution is the probability mass or density seen on the output line after applying that function.

Example

Finite sample space: the map and the law are different

Let Ω={a,b,c,d}\Omega=\{a,b,c,d\}, let F=2Ω\mathcal F=2^\Omega, and assign probabilities 1/10,2/10,3/10,4/101/10,2/10,3/10,4/10. Define

X(a)=0,X(b)=1,X(c)=1,X(d)=3.X(a)=0,\qquad X(b)=1,\qquad X(c)=1,\qquad X(d)=3.

The function XX has four inputs. Its law has three output values:

  • PX({0})=1/10P_X(\{0\})=1/10
  • PX({1})=2/10+3/10=5/10P_X(\{1\})=2/10+3/10=5/10
  • PX({3})=4/10P_X(\{3\})=4/10

For the output event B={1,3}B=\{1,3\}, the preimage is X1(B)={b,c,d}X^{-1}(B)=\{b,c,d\}, so PX(B)=9/10P_X(B)=9/10. This is the core move: questions about values are answered by pulling them back to events.

Definition

Cumulative Distribution Function

For a real-valued random variable XX, the cumulative distribution function is

FX(t)=P(Xt)=PX((,t]).F_X(t)=\mathbb P(X\le t)=P_X((-\infty,t]).

The CDF is one way to encode the pushforward law on the real line.

A random variable can be discrete, continuous, mixed, or neither in the elementary density/mass-function sense. The law PXP_X always exists for a measurable real-valued random variable. A probability mass function or density exists only under extra structure.

To construct a real-valued random variable carefully, keep the objects in this order:

  1. Choose the measurable outcome space (Ω,F)(\Omega,\mathcal F).
  2. Define a numerical map X:ΩRX:\Omega\to\mathbb R.
  3. Check that X1(B)FX^{-1}(B)\in\mathcal F for every Borel set BB.
  4. Push P\mathbb P through XX to get the law PXP_X.

On a finite space with F=2Ω\mathcal F=2^\Omega, the measurability check is automatic. In measure theory, it is not a decorative condition: it is what makes probabilities such as P(Xt)\mathbb P(X\le t) legal.

Pushforward Theorem

Theorem

A Random Variable Induces a Probability Distribution

Statement

If X:(Ω,F)(R,B(R))X:(\Omega,\mathcal F)\to(\mathbb R,\mathcal B(\mathbb R)) is measurable, then

PX(B)=P(X1(B))P_X(B)=\mathbb P(X^{-1}(B))

defines a probability measure on B(R)\mathcal B(\mathbb R).

Intuition

To measure a set of output values BB, pull it back through XX to the set of outcomes that land in BB. The probability of that preimage is the probability assigned to BB under the distribution of XX.

Proof Sketch

Measurability gives X1(B)FX^{-1}(B)\in\mathcal F, so PX(B)P_X(B) is defined. Non-negativity follows from non-negativity of P\mathbb P. Normalization holds because X1(R)=ΩX^{-1}(\mathbb R)=\Omega, so PX(R)=1P_X(\mathbb R)=1. For pairwise disjoint Borel sets B1,B2,B_1,B_2,\ldots, their preimages are pairwise disjoint events and X1(nBn)=nX1(Bn)X^{-1}(\bigcup_n B_n)=\bigcup_n X^{-1}(B_n). Countable additivity of P\mathbb P gives countable additivity of PXP_X.

Why It Matters

This is why you can often forget the original sample space and reason only with the distribution of XX. Expectations, quantiles, densities, and tail bounds are all statements about the pushforward law.

Failure Mode

The distribution does not remember everything about the original probability space. Two different random variables on different sample spaces can have the same law. Distributional equality does not mean the variables are the same function or even live on the same Ω\Omega.

Generated Information

Definition

Sigma-Algebra Generated by a Random Variable

The sigma-algebra generated by XX is

σ(X)={X1(B):BB(R)}.\sigma(X)=\{X^{-1}(B):B\in\mathcal B(\mathbb R)\}.

It is the information revealed by observing XX. An event belongs to σ(X)\sigma(X) exactly when it can be decided from the value of XX.

This is the clean way to say "what the variable tells you." If Y=g(X)Y=g(X) for some measurable function gg, then σ(Y)σ(X)\sigma(Y)\subseteq\sigma(X): observing XX tells you at least as much as observing YY.

Theorem

Generated Sigma-Algebra Is the Smallest Information Making X Measurable

Statement

The sigma-algebra σ(X)\sigma(X) is the smallest sub-sigma-algebra GF\mathcal G\subseteq\mathcal F such that XX is measurable as a map from (Ω,G)(\Omega,\mathcal G) to (R,B(R))(\mathbb R,\mathcal B(\mathbb R)).

Intuition

To observe XX, you must be able to answer every question of the form "did XX land in this Borel set?" The answers to those questions are exactly the preimages X1(B)X^{-1}(B).

Proof Sketch

Every set X1(B)X^{-1}(B) is needed for XX to be G\mathcal G-measurable, so any sigma-algebra that makes XX measurable must contain all such preimages. The collection of those preimages is already closed under complements and countable unions because preimages preserve set operations. Therefore it is the smallest such sigma-algebra.

Why It Matters

This is the bridge from random variables to information. Conditional expectation, filtrations, sufficient statistics, and data leakage all ask which sigma-algebra is available at decision time.

Failure Mode

Do not read σ(X)\sigma(X) as standard deviation. Here σ\sigma means sigma-algebra: the collection of events whose truth can be decided from observing XX.

Expectation

Definition

Expectation

For an integrable random variable XX,

E[X]=ΩX(ω)dP(ω)=RxdPX(x).\mathbb E[X]=\int_\Omega X(\omega)\,d\mathbb P(\omega)=\int_\mathbb R x\,dP_X(x).

The first integral views XX as a function on outcomes. The second views it through its distribution. They agree by the change-of-variables formula for pushforward measures.

For a finite sample space, this becomes the weighted average

E[X]=ωΩX(ω)P({ω}).\mathbb E[X]=\sum_{\omega\in\Omega}X(\omega)\mathbb P(\{\omega\}).

For a continuous density fXf_X, it becomes

E[X]=xfX(x)dx.\mathbb E[X]=\int_{-\infty}^{\infty}x f_X(x)\,dx.

The formula changes with representation; the object does not. This is why the reference page on common probability distributions can work directly with laws such as Bernoulli, Gaussian, and triangular laws without re-describing the original sample space every time.

Theorem

Expectation Depends on the Pushforward Law

Statement

If g:RRg:\mathbb R\to\mathbb R is measurable and g(X)g(X) is integrable, then

E[g(X)]=Ωg(X(ω))dP(ω)=Rg(x)dPX(x).\mathbb E[g(X)]=\int_\Omega g(X(\omega))\,d\mathbb P(\omega)=\int_\mathbb R g(x)\,dP_X(x).

Intuition

To average a function of XX, you do not need the original outcome labels. You only need how much probability lands at each output value.

Proof Sketch

First prove the identity for indicator functions g=1Bg=\mathbf 1_B; it is exactly the definition of PX(B)P_X(B). Extend to nonnegative simple functions by linearity, to nonnegative measurable functions by monotone convergence, and to integrable signed functions by positive and negative parts.

Why It Matters

This theorem justifies computing means and moments from a distribution table, density, CDF, or simulation histogram. It is the rigorous reason the Probability Mechanics Lab can work with the output distribution after the map has been built.

Failure Mode

The formula requires integrability. A random variable can be measurable while E[X]\mathbb E[X] or E[g(X)]\mathbb E[g(X)] is undefined or infinite.

Equality Notions

There are three common equalities, and they are not interchangeable:

  • X=YX=Y means the two functions agree at every outcome.
  • X=YX=Y almost surely means P(X=Y)=1\mathbb P(X=Y)=1.
  • X=dYX\overset{d}{=}Y means PX=PYP_X=P_Y, so the variables have the same law.

Almost-sure equality is enough for most probabilistic calculations because probability-zero exceptions do not change integrals. Equality in distribution is weaker: two variables can have the same law while living on different sample spaces or being independent copies on the same space.

Random Variables as Right Triangles

If XX has finite second moment, it lives in the Hilbert space L2(Ω,F,P)L^2(\Omega,\mathcal F,\mathbb P) with inner product

U,V=E[UV].\langle U,V\rangle=\mathbb E[UV].

After centering, XE[X]X-\mathbb E[X] has squared length

XE[X]22=Var(X).\|X-\mathbb E[X]\|_2^2=\operatorname{Var}(X).

That is the bridge to right triangles. Conditional expectation is an orthogonal projection onto a smaller information space. The Pythagorean theorem becomes the law of total variance.

Theorem

Law of Total Variance as a Right Triangle

Statement

Let GF\mathcal G\subseteq\mathcal F and let XL2X\in L^2. Then

XE[X]=(E[XG]E[X])+(XE[XG]),X-\mathbb E[X]=\big(\mathbb E[X\mid\mathcal G]-\mathbb E[X]\big)+\big(X-\mathbb E[X\mid\mathcal G]\big),

and the two terms on the right are orthogonal in L2L^2. Therefore

Var(X)=Var(E[XG])+E[Var(XG)].\operatorname{Var}(X)=\operatorname{Var}(\mathbb E[X\mid\mathcal G])+\mathbb E[\operatorname{Var}(X\mid\mathcal G)].

Intuition

The first leg is variation explained by the information G\mathcal G. The second leg is residual variation left after that information is used. They meet at a right angle because a projection error is orthogonal to the projection space.

Proof Sketch

Conditional expectation E[XG]\mathbb E[X\mid\mathcal G] is the L2L^2 projection of XX onto the closed subspace of G\mathcal G-measurable random variables. Projection geometry gives orthogonality between the projected component and the residual. Taking squared L2L^2 norms gives the Pythagorean identity; the two squared norms are exactly the two variance terms.

Why It Matters

This is the geometric reason behind ANOVA, bias-variance decompositions, random effects models, and the idea that a feature explains part of the variation in a target.

Failure Mode

This geometric statement needs XL2X\in L^2. Heavy-tailed random variables without finite variance can still be random variables, but variance is no longer a finite squared length.

Common Confusions

Watch Out

A random variable is not random in the programming sense

In mathematics, XX is a fixed function. Randomness enters because the input ω\omega is sampled according to P\mathbb P. A simulation may resample an outcome, but the random variable itself is the rule mapping outcomes to numbers.

Watch Out

The distribution is not the random variable

Two random variables can have the same distribution while being different functions. For example, on two independent coin flips, XX can read the first coin and YY can read the second coin. Both are Bernoulli with the same parameter, but XX and YY are different random variables.

Watch Out

Probability zero does not mean impossible

If XX is continuous, then P(X=x)=0\mathbb P(X=x)=0 for every fixed xx, but XX still takes some value with probability one. A point can have zero probability without being excluded from the sample space.

Exercises

ExerciseCore

Problem

Let Ω={a,b,c}\Omega=\{a,b,c\} with probabilities 1/4,1/4,1/21/4,1/4,1/2. Define X(a)=0X(a)=0, X(b)=2X(b)=2, and X(c)=2X(c)=2. Find the distribution of XX.

ExerciseCore

Problem

Suppose XX and YY satisfy X=dYX\overset{d}{=}Y. Must X=YX=Y almost surely?

ExerciseAdvanced

Problem

Let XX be a finite-variance random variable and let G\mathcal G be a sub-sigma-algebra. Prove that E[(XE[XG])Z]=0\mathbb E[(X-\mathbb E[X\mid\mathcal G])Z]=0 for every bounded G\mathcal G-measurable random variable ZZ.

ExerciseAdvanced

Problem

Let XX be a nonnegative random variable and gg be nonnegative and measurable. Sketch why E[g(X)]=RgdPX\mathbb E[g(X)]=\int_\mathbb R g\,dP_X follows from the identity for indicator functions.

References

Canonical:

  • Billingsley, Probability and Measure (1995), Chapters 1-5
  • Durrett, Probability: Theory and Examples (2019), Chapters 1-2
  • Williams, Probability with Martingales (1991), Chapters 1-5
  • Pollard, A User's Guide to Measure Theoretic Probability (2002), Chapters 1-3
  • Kallenberg, Foundations of Modern Probability (2021), Chapters 1-2

For intuition and applications:

  • Blitzstein and Hwang, Introduction to Probability (2019), Chapters 1-4
  • Grimmett and Stirzaker, Probability and Random Processes (2020), Chapters 1-3
  • Deisenroth, Faisal, Ong, Mathematics for Machine Learning (2020), Chapter 6

Last reviewed: April 22, 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics