Random Variables

Sneiderman, Robby

Foundations

Random Variables

Random variables as measurement rules, their distributions, expectation and variance, and the rigorous measurable-map definition used in probability theory.

CoreTier 1StableCore spine~70 min

Prerequisites

Kolmogorov Probability Axioms Sets Functions and Relations

Start 8-question practice · 25 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

foundations | layer 0A | tier 1. This page has 2 direct prerequisites and 12 published dependents.

Open Atlas Prerequisites Leads to

What next

Expectation, Variance, Covariance, and Moments

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

If you are new to probability, read a random variable as a measurement whose value is unknown before the experiment happens. Roll a die and record the face. Flip ten coins and record the number of heads. Draw a dataset and record the validation accuracy of a model trained on it. Each recorded quantity is a random variable.

That beginner sentence is useful, but it hides one important detail: the random variable is the rule, not the value you happened to observe. The die roll can land on $4$ , but the random variable is the rule "return the face showing on the die." The next roll may produce a different value; the rule is the same.

This matters in ML because almost every performance number is a random variable: training loss depends on the sampled dataset, validation accuracy depends on the split, a stochastic gradient depends on the minibatch, and a model output can depend on noisy inputs. Once that is clear, expectation means "average this quantity over repeated draws," and variance means "how much this quantity changes across draws."

Quick Version, No Measure Theory Yet

You can understand most first-course probability with this four-part picture:

Object	Plain meaning	Example
Experiment	The random thing that happens	Roll a die
Outcome	One realized result	The die shows $4$
Random variable	A rule that turns an outcome into a number	$X=$ die face
Distribution	The probabilities of the possible output values	$\mathbb P(X=k)=1/6$ for $k=1,\ldots,6$

The notation $X$ does two jobs. Before the experiment, $X$ is an unknown quantity with a distribution. After the experiment, you observe one value such as $X=4$ . Most confusion comes from mixing up those two viewpoints.

Example

Three random variables from the same die roll

Roll one fair die.

$X=$ the face value, so $X\in\{1,2,3,4,5,6\}$ .
$Y=$ whether the roll is even, coded as $1$ for even and $0$ for odd.
$Z=(X-3.5)^2$ , the squared distance from the middle.

The same outcome feeds three different rules. If the die shows $4$ , then $X=4$ , $Y=1$ , and $Z=0.25$ . The rules are fixed before the roll; only the realized outcome changes.

For many applications, this is enough: define the quantity you care about, write down or estimate its distribution, then compute probabilities, expectations, variances, or tail bounds. The rigorous definition below exists to make that same idea work on infinite spaces, continuous variables, random vectors, stochastic processes, and conditional information.

Formal Definition

Now translate the friendly version into probability theory. The experiment is modeled by a probability space $(\Omega,\mathcal F,\mathbb P)$ . The set $\Omega$ contains outcomes, $\mathcal F$ contains the events whose probabilities are defined, and $\mathbb P$ assigns those probabilities. A real-valued random variable is a number-producing rule on that space.

Definition

Random Variable $X : (Ω, F) \to (R, B (R))$

Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space. A real-valued random variable is a measurable function

$X:(\Omega,\mathcal F)\to(\mathbb R,\mathcal B(\mathbb R)),$

meaning that for every Borel set $B\in\mathcal B(\mathbb R)$ ,

$X^{-1}(B)=\{\omega\in\Omega:X(\omega)\in B\}\in\mathcal F.$

The measurability condition says every numerical event about $X$ is an event whose probability is defined.

For example, the event $\{X\le t\}$ is shorthand for $\{\omega:X(\omega)\le t\}$ . If $X$ were not measurable, this set might not belong to $\mathcal F$ , so $\mathbb P(X\le t)$ would not be meaningful.

The probability measure is not needed to say that $X$ is measurable; it is needed to assign probabilities to the pulled-back events. More generally, a random element is a measurable map from $(\Omega,\mathcal F)$ into another measurable space $(S,\mathcal S)$ . Real-valued random variables use $(S,\mathcal S)=(\mathbb R,\mathcal B(\mathbb R))$ . Random vectors use $(\mathbb R^d,\mathcal B(\mathbb R^d))$ .

Definition

Distribution or Law $P_{X}$

The distribution or law of $X$ is the probability measure $P_X$ on $(\mathbb R,\mathcal B(\mathbb R))$ defined by

$P_X(B)=\mathbb P(X\in B)=\mathbb P(X^{-1}(B)).$

This is also called the pushforward measure of $\mathbb P$ through $X$ , written $P_X=X_\#\mathbb P$ .

The random variable and its distribution are different objects. The random variable is a function on outcomes. The distribution is the probability mass or density seen on the output line after applying that function.

Example

Finite sample space: the map and the law are different

Let $\Omega=\{a,b,c,d\}$ , let $\mathcal F=2^\Omega$ , and assign probabilities $1/10,2/10,3/10,4/10$ . Define

$X(a)=0,\qquad X(b)=1,\qquad X(c)=1,\qquad X(d)=3.$

The function $X$ has four inputs. Its law has three output values:

$P_X(\{0\})=1/10$
$P_X(\{1\})=2/10+3/10=5/10$
$P_X(\{3\})=4/10$

For the output event $B=\{1,3\}$ , the preimage is $X^{-1}(B)=\{b,c,d\}$ , so $P_X(B)=9/10$ . This is the core move: questions about values are answered by pulling them back to events.

Definition

Cumulative Distribution Function $F_{X} (t)$

For a real-valued random variable $X$ , the cumulative distribution function is

$F_X(t)=\mathbb P(X\le t)=P_X((-\infty,t]).$

The CDF is one way to encode the pushforward law on the real line.

A random variable can be discrete, continuous, mixed, or neither in the elementary density/mass-function sense. The law $P_X$ always exists for a measurable real-valued random variable. A probability mass function or density exists only under extra structure.

To construct a real-valued random variable carefully, keep the objects in this order:

Choose the measurable outcome space $(\Omega,\mathcal F)$ .
Define a numerical map $X:\Omega\to\mathbb R$ .
Check that $X^{-1}(B)\in\mathcal F$ for every Borel set $B$ .
Push $\mathbb P$ through $X$ to get the law $P_X$ .

On a finite space with $\mathcal F=2^\Omega$ , the measurability check is automatic. In measure theory, it is not a decorative condition: it is what makes probabilities such as $\mathbb P(X\le t)$ legal.

Pushforward Theorem

Theorem

A Random Variable Induces a Probability Distribution

Statement

If $X:(\Omega,\mathcal F)\to(\mathbb R,\mathcal B(\mathbb R))$ is measurable, then

$P_X(B)=\mathbb P(X^{-1}(B))$

defines a probability measure on $\mathcal B(\mathbb R)$ .

Intuition

To measure a set of output values $B$ , pull it back through $X$ to the set of outcomes that land in $B$ . The probability of that preimage is the probability assigned to $B$ under the distribution of $X$ .

Proof Sketch

Measurability gives $X^{-1}(B)\in\mathcal F$ , so $P_X(B)$ is defined. Non-negativity follows from non-negativity of $\mathbb P$ . Normalization holds because $X^{-1}(\mathbb R)=\Omega$ , so $P_X(\mathbb R)=1$ . For pairwise disjoint Borel sets $B_1,B_2,\ldots$ , their preimages are pairwise disjoint events and $X^{-1}(\bigcup_n B_n)=\bigcup_n X^{-1}(B_n)$ . Countable additivity of $\mathbb P$ gives countable additivity of $P_X$ .

Why It Matters

This is why you can often forget the original sample space and reason only with the distribution of $X$ . Expectations, quantiles, densities, and tail bounds are all statements about the pushforward law.

Failure Mode

The distribution does not remember everything about the original probability space. Two different random variables on different sample spaces can have the same law. Distributional equality does not mean the variables are the same function or even live on the same $\Omega$ .

report a correction →

Generated Information

Definition

Sigma-Algebra Generated by a Random Variable $σ (X)$

The sigma-algebra generated by $X$ is

$\sigma(X)=\{X^{-1}(B):B\in\mathcal B(\mathbb R)\}.$

It is the information revealed by observing $X$ . An event belongs to $\sigma(X)$ exactly when it can be decided from the value of $X$ .

This is the clean way to say "what the variable tells you." If $Y=g(X)$ for some measurable function $g$ , then $\sigma(Y)\subseteq\sigma(X)$ : observing $X$ tells you at least as much as observing $Y$ .

Theorem

Generated Sigma-Algebra Is the Smallest Information Making X Measurable

Statement

The sigma-algebra $\sigma(X)$ is the smallest sub-sigma-algebra $\mathcal G\subseteq\mathcal F$ such that $X$ is measurable as a map from $(\Omega,\mathcal G)$ to $(\mathbb R,\mathcal B(\mathbb R))$ .

Intuition

To observe $X$ , you must be able to answer every question of the form "did $X$ land in this Borel set?" The answers to those questions are exactly the preimages $X^{-1}(B)$ .

Proof Sketch

Every set $X^{-1}(B)$ is needed for $X$ to be $\mathcal G$ -measurable, so any sigma-algebra that makes $X$ measurable must contain all such preimages. The collection of those preimages is already closed under complements and countable unions because preimages preserve set operations. Therefore it is the smallest such sigma-algebra.

Why It Matters

This is the bridge from random variables to information. Conditional expectation, filtrations, sufficient statistics, and data leakage all ask which sigma-algebra is available at decision time.

Failure Mode

Do not read $\sigma(X)$ as standard deviation. Here $\sigma$ means sigma-algebra: the collection of events whose truth can be decided from observing $X$ .

report a correction →

Expectation

Definition

Expectation $E [X]$

For an integrable random variable $X$ ,

$\mathbb E[X]=\int_\Omega X(\omega)\,d\mathbb P(\omega)=\int_\mathbb R x\,dP_X(x).$

The first integral views $X$ as a function on outcomes. The second views it through its distribution. They agree by the change-of-variables formula for pushforward measures.

For a finite sample space, this becomes the weighted average

$\mathbb E[X]=\sum_{\omega\in\Omega}X(\omega)\mathbb P(\{\omega\}).$

For a continuous density $f_X$ , it becomes

$\mathbb E[X]=\int_{-\infty}^{\infty}x f_X(x)\,dx.$

The formula changes with representation; the object does not. This is why the reference page on common probability distributions can work directly with laws such as Bernoulli, Gaussian, and triangular laws without re-describing the original sample space every time.

Theorem

Expectation Depends on the Pushforward Law

Statement

If $g:\mathbb R\to\mathbb R$ is measurable and $g(X)$ is integrable, then

$\mathbb E[g(X)]=\int_\Omega g(X(\omega))\,d\mathbb P(\omega)=\int_\mathbb R g(x)\,dP_X(x).$

Intuition

To average a function of $X$ , you do not need the original outcome labels. You only need how much probability lands at each output value.

Proof Sketch

First prove the identity for indicator functions $g=\mathbf 1_B$ ; it is exactly the definition of $P_X(B)$ . Extend to nonnegative simple functions by linearity, to nonnegative measurable functions by monotone convergence, and to integrable signed functions by positive and negative parts.

Why It Matters

This theorem justifies computing means and moments from a distribution table, density, CDF, or simulation histogram. It is the rigorous reason the Probability Mechanics Lab can work with the output distribution after the map has been built.

Failure Mode

The formula requires integrability. A random variable can be measurable while $\mathbb E[X]$ or $\mathbb E[g(X)]$ is undefined or infinite.

report a correction →

Equality Notions

There are three common equalities, and they are not interchangeable:

$X=Y$ means the two functions agree at every outcome.
$X=Y$ almost surely means $\mathbb P(X=Y)=1$ .
$X\overset{d}{=}Y$ means $P_X=P_Y$ , so the variables have the same law.

Almost-sure equality is enough for most probabilistic calculations because probability-zero exceptions do not change integrals. Equality in distribution is weaker: two variables can have the same law while living on different sample spaces or being independent copies on the same space.

Random Variables as Right Triangles

If $X$ has finite second moment, it lives in the Hilbert space $L^2(\Omega,\mathcal F,\mathbb P)$ with inner product

$\langle U,V\rangle=\mathbb E[UV].$

After centering, $X-\mathbb E[X]$ has squared length

$\|X-\mathbb E[X]\|_2^2=\operatorname{Var}(X).$

That is the bridge to right triangles. Conditional expectation is an orthogonal projection onto a smaller information space. The Pythagorean theorem becomes the law of total variance.

Theorem

Law of Total Variance as a Right Triangle

Statement

Let $\mathcal G\subseteq\mathcal F$ and let $X\in L^2$ . Then

$X-\mathbb E[X]=\big(\mathbb E[X\mid\mathcal G]-\mathbb E[X]\big)+\big(X-\mathbb E[X\mid\mathcal G]\big),$

and the two terms on the right are orthogonal in $L^2$ . Therefore

$\operatorname{Var}(X)=\operatorname{Var}(\mathbb E[X\mid\mathcal G])+\mathbb E[\operatorname{Var}(X\mid\mathcal G)].$

Intuition

The first leg is variation explained by the information $\mathcal G$ . The second leg is residual variation left after that information is used. They meet at a right angle because a projection error is orthogonal to the projection space.

Proof Sketch

Conditional expectation $\mathbb E[X\mid\mathcal G]$ is the $L^2$ projection of $X$ onto the closed subspace of $\mathcal G$ -measurable random variables. Projection geometry gives orthogonality between the projected component and the residual. Taking squared $L^2$ norms gives the Pythagorean identity; the two squared norms are exactly the two variance terms.

Why It Matters

This is the geometric reason behind ANOVA, bias-variance decompositions, random effects models, and the idea that a feature explains part of the variation in a target.

Failure Mode

This geometric statement needs $X\in L^2$ . Heavy-tailed random variables without finite variance can still be random variables, but variance is no longer a finite squared length.

report a correction →

Common Confusions

Watch Out

A random variable is not random in the programming sense

In mathematics, $X$ is a fixed function. Randomness enters because the input $\omega$ is sampled according to $\mathbb P$ . A simulation may resample an outcome, but the random variable itself is the rule mapping outcomes to numbers.

Watch Out

The distribution is not the random variable

Two random variables can have the same distribution while being different functions. For example, on two independent coin flips, $X$ can read the first coin and $Y$ can read the second coin. Both are Bernoulli with the same parameter, but $X$ and $Y$ are different random variables.

Watch Out

Probability zero does not mean impossible

If $X$ is continuous, then $\mathbb P(X=x)=0$ for every fixed $x$ , but $X$ still takes some value with probability one. A point can have zero probability without being excluded from the sample space.

Exercises

ExerciseCore

Problem

Let $\Omega=\{a,b,c\}$ with probabilities $1/4,1/4,1/2$ . Define $X(a)=0$ , $X(b)=2$ , and $X(c)=2$ . Find the distribution of $X$ .

ExerciseCore

Problem

Suppose $X$ and $Y$ satisfy $X\overset{d}{=}Y$ . Must $X=Y$ almost surely?

ExerciseAdvanced

Problem

Let $X$ be a finite-variance random variable and let $\mathcal G$ be a sub-sigma-algebra. Prove that $\mathbb E[(X-\mathbb E[X\mid\mathcal G])Z]=0$ for every bounded $\mathcal G$ -measurable random variable $Z$ .

ExerciseAdvanced

Problem

Let $X$ be a nonnegative random variable and $g$ be nonnegative and measurable. Sketch why $\mathbb E[g(X)]=\int_\mathbb R g\,dP_X$ follows from the identity for indicator functions.

References

Canonical:

Billingsley, Probability and Measure (1995), Chapters 1-5
Durrett, Probability: Theory and Examples (2019), Chapters 1-2
Williams, Probability with Martingales (1991), Chapters 1-5
Pollard, A User's Guide to Measure Theoretic Probability (2002), Chapters 1-3
Kallenberg, Foundations of Modern Probability (2021), Chapters 1-2

For intuition and applications:

Blitzstein and Hwang, Introduction to Probability (2019), Chapters 1-4
Grimmett and Stirzaker, Probability and Random Processes (2020), Chapters 1-3
Deisenroth, Faisal, Ong, Mathematics for Machine Learning (2020), Chapter 6

Last reviewed: April 22, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Kolmogorov Probability Axiomslayer 0A · tier 1
Sets, Functions, and Relationslayer 0A · tier 1

Derived topics

12

Common Probability Distributionslayer 0A · tier 1
Distributions Atlaslayer 0A · tier 1
Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
Joint, Marginal, and Conditional Distributionslayer 0A · tier 1
Law of Large Numberslayer 0B · tier 1

+7 more on the derived-topics page.

Graph-backed continuations

Expectation, Variance, Covariance, and Moments Common Probability Distributions Triangular Distribution Joint, Marginal, and Conditional Distributions Measure-Theoretic Probability Adaptive Learning Is Not IID Law of Large Numbers Loss Functions