Foundations
Random Variables
Random variables as measurement rules, their distributions, expectation and variance, and the rigorous measurable-map definition used in probability theory.
Why This Matters
If you are new to probability, read a random variable as a measurement whose value is unknown before the experiment happens. Roll a die and record the face. Flip ten coins and record the number of heads. Draw a dataset and record the validation accuracy of a model trained on it. Each recorded quantity is a random variable.
That beginner sentence is useful, but it hides one important detail: the random variable is the rule, not the value you happened to observe. The die roll can land on , but the random variable is the rule "return the face showing on the die." The next roll may produce a different value; the rule is the same.
This matters in ML because almost every performance number is a random variable: training loss depends on the sampled dataset, validation accuracy depends on the split, a stochastic gradient depends on the minibatch, and a model output can depend on noisy inputs. Once that is clear, expectation means "average this quantity over repeated draws," and variance means "how much this quantity changes across draws."
Quick Version, No Measure Theory Yet
You can understand most first-course probability with this four-part picture:
| Object | Plain meaning | Example |
|---|---|---|
| Experiment | The random thing that happens | Roll a die |
| Outcome | One realized result | The die shows |
| Random variable | A rule that turns an outcome into a number | die face |
| Distribution | The probabilities of the possible output values | for |
The notation does two jobs. Before the experiment, is an unknown quantity with a distribution. After the experiment, you observe one value such as . Most confusion comes from mixing up those two viewpoints.
Three random variables from the same die roll
Roll one fair die.
- the face value, so .
- whether the roll is even, coded as for even and for odd.
- , the squared distance from the middle.
The same outcome feeds three different rules. If the die shows , then , , and . The rules are fixed before the roll; only the realized outcome changes.
For many applications, this is enough: define the quantity you care about, write down or estimate its distribution, then compute probabilities, expectations, variances, or tail bounds. The rigorous definition below exists to make that same idea work on infinite spaces, continuous variables, random vectors, stochastic processes, and conditional information.
Formal Definition
Now translate the friendly version into probability theory. The experiment is modeled by a probability space . The set contains outcomes, contains the events whose probabilities are defined, and assigns those probabilities. A real-valued random variable is a number-producing rule on that space.
Random Variable
Let be a probability space. A real-valued random variable is a measurable function
meaning that for every Borel set ,
The measurability condition says every numerical event about is an event whose probability is defined.
For example, the event is shorthand for . If were not measurable, this set might not belong to , so would not be meaningful.
The probability measure is not needed to say that is measurable; it is needed to assign probabilities to the pulled-back events. More generally, a random element is a measurable map from into another measurable space . Real-valued random variables use . Random vectors use .
Distribution or Law
The distribution or law of is the probability measure on defined by
This is also called the pushforward measure of through , written .
The random variable and its distribution are different objects. The random variable is a function on outcomes. The distribution is the probability mass or density seen on the output line after applying that function.
Finite sample space: the map and the law are different
Let , let , and assign probabilities . Define
The function has four inputs. Its law has three output values:
For the output event , the preimage is , so . This is the core move: questions about values are answered by pulling them back to events.
Cumulative Distribution Function
For a real-valued random variable , the cumulative distribution function is
The CDF is one way to encode the pushforward law on the real line.
A random variable can be discrete, continuous, mixed, or neither in the elementary density/mass-function sense. The law always exists for a measurable real-valued random variable. A probability mass function or density exists only under extra structure.
To construct a real-valued random variable carefully, keep the objects in this order:
- Choose the measurable outcome space .
- Define a numerical map .
- Check that for every Borel set .
- Push through to get the law .
On a finite space with , the measurability check is automatic. In measure theory, it is not a decorative condition: it is what makes probabilities such as legal.
Pushforward Theorem
A Random Variable Induces a Probability Distribution
Statement
If is measurable, then
defines a probability measure on .
Intuition
To measure a set of output values , pull it back through to the set of outcomes that land in . The probability of that preimage is the probability assigned to under the distribution of .
Proof Sketch
Measurability gives , so is defined. Non-negativity follows from non-negativity of . Normalization holds because , so . For pairwise disjoint Borel sets , their preimages are pairwise disjoint events and . Countable additivity of gives countable additivity of .
Why It Matters
This is why you can often forget the original sample space and reason only with the distribution of . Expectations, quantiles, densities, and tail bounds are all statements about the pushforward law.
Failure Mode
The distribution does not remember everything about the original probability space. Two different random variables on different sample spaces can have the same law. Distributional equality does not mean the variables are the same function or even live on the same .
Generated Information
Sigma-Algebra Generated by a Random Variable
The sigma-algebra generated by is
It is the information revealed by observing . An event belongs to exactly when it can be decided from the value of .
This is the clean way to say "what the variable tells you." If for some measurable function , then : observing tells you at least as much as observing .
Generated Sigma-Algebra Is the Smallest Information Making X Measurable
Statement
The sigma-algebra is the smallest sub-sigma-algebra such that is measurable as a map from to .
Intuition
To observe , you must be able to answer every question of the form "did land in this Borel set?" The answers to those questions are exactly the preimages .
Proof Sketch
Every set is needed for to be -measurable, so any sigma-algebra that makes measurable must contain all such preimages. The collection of those preimages is already closed under complements and countable unions because preimages preserve set operations. Therefore it is the smallest such sigma-algebra.
Why It Matters
This is the bridge from random variables to information. Conditional expectation, filtrations, sufficient statistics, and data leakage all ask which sigma-algebra is available at decision time.
Failure Mode
Do not read as standard deviation. Here means sigma-algebra: the collection of events whose truth can be decided from observing .
Expectation
Expectation
For an integrable random variable ,
The first integral views as a function on outcomes. The second views it through its distribution. They agree by the change-of-variables formula for pushforward measures.
For a finite sample space, this becomes the weighted average
For a continuous density , it becomes
The formula changes with representation; the object does not. This is why the reference page on common probability distributions can work directly with laws such as Bernoulli, Gaussian, and triangular laws without re-describing the original sample space every time.
Expectation Depends on the Pushforward Law
Statement
If is measurable and is integrable, then
Intuition
To average a function of , you do not need the original outcome labels. You only need how much probability lands at each output value.
Proof Sketch
First prove the identity for indicator functions ; it is exactly the definition of . Extend to nonnegative simple functions by linearity, to nonnegative measurable functions by monotone convergence, and to integrable signed functions by positive and negative parts.
Why It Matters
This theorem justifies computing means and moments from a distribution table, density, CDF, or simulation histogram. It is the rigorous reason the Probability Mechanics Lab can work with the output distribution after the map has been built.
Failure Mode
The formula requires integrability. A random variable can be measurable while or is undefined or infinite.
Equality Notions
There are three common equalities, and they are not interchangeable:
- means the two functions agree at every outcome.
- almost surely means .
- means , so the variables have the same law.
Almost-sure equality is enough for most probabilistic calculations because probability-zero exceptions do not change integrals. Equality in distribution is weaker: two variables can have the same law while living on different sample spaces or being independent copies on the same space.
Random Variables as Right Triangles
If has finite second moment, it lives in the Hilbert space with inner product
After centering, has squared length
That is the bridge to right triangles. Conditional expectation is an orthogonal projection onto a smaller information space. The Pythagorean theorem becomes the law of total variance.
Law of Total Variance as a Right Triangle
Statement
Let and let . Then
and the two terms on the right are orthogonal in . Therefore
Intuition
The first leg is variation explained by the information . The second leg is residual variation left after that information is used. They meet at a right angle because a projection error is orthogonal to the projection space.
Proof Sketch
Conditional expectation is the projection of onto the closed subspace of -measurable random variables. Projection geometry gives orthogonality between the projected component and the residual. Taking squared norms gives the Pythagorean identity; the two squared norms are exactly the two variance terms.
Why It Matters
This is the geometric reason behind ANOVA, bias-variance decompositions, random effects models, and the idea that a feature explains part of the variation in a target.
Failure Mode
This geometric statement needs . Heavy-tailed random variables without finite variance can still be random variables, but variance is no longer a finite squared length.
Common Confusions
A random variable is not random in the programming sense
In mathematics, is a fixed function. Randomness enters because the input is sampled according to . A simulation may resample an outcome, but the random variable itself is the rule mapping outcomes to numbers.
The distribution is not the random variable
Two random variables can have the same distribution while being different functions. For example, on two independent coin flips, can read the first coin and can read the second coin. Both are Bernoulli with the same parameter, but and are different random variables.
Probability zero does not mean impossible
If is continuous, then for every fixed , but still takes some value with probability one. A point can have zero probability without being excluded from the sample space.
Exercises
Problem
Let with probabilities . Define , , and . Find the distribution of .
Problem
Suppose and satisfy . Must almost surely?
Problem
Let be a finite-variance random variable and let be a sub-sigma-algebra. Prove that for every bounded -measurable random variable .
Problem
Let be a nonnegative random variable and be nonnegative and measurable. Sketch why follows from the identity for indicator functions.
References
Canonical:
- Billingsley, Probability and Measure (1995), Chapters 1-5
- Durrett, Probability: Theory and Examples (2019), Chapters 1-2
- Williams, Probability with Martingales (1991), Chapters 1-5
- Pollard, A User's Guide to Measure Theoretic Probability (2002), Chapters 1-3
- Kallenberg, Foundations of Modern Probability (2021), Chapters 1-2
For intuition and applications:
- Blitzstein and Hwang, Introduction to Probability (2019), Chapters 1-4
- Grimmett and Stirzaker, Probability and Random Processes (2020), Chapters 1-3
- Deisenroth, Faisal, Ong, Mathematics for Machine Learning (2020), Chapter 6
Last reviewed: April 22, 2026
Prerequisites
Foundations this topic depends on.
- Kolmogorov Probability AxiomsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A