Skip to main content

Foundations

Kolmogorov Probability Axioms

The three axioms (non-negativity, normalization, countable additivity) that every probability claim on this site implicitly invokes. Sample space, event sigma-algebra, probability measure, and the immediate consequences.

CoreTier 1Stable~30 min
0

Why This Matters

Every probabilistic statement on this site, from a single-parameter Bernoulli likelihood to the convergence guarantee of stochastic gradient descent, rests on three axioms written down by Kolmogorov in 1933. The axioms do not say what probability means; they say what any consistent assignment of probabilities must satisfy. Frequentist long-run frequencies, Bayesian degrees of belief, and classical equally-likely-outcomes interpretations all produce probabilities obeying the same axioms. The interpretation is philosophical; the axioms are mathematical.

Reading this page is what lets every later result be unambiguous. When measure-theoretic probability writes "let (Ω,F,P)(\Omega, \mathcal{F}, P) be a probability space," this page is what that phrase means.

The Three Objects

Probability requires three objects, fixed before any random variable is introduced.

Definition

Sample Space

The sample space Ω\Omega is a non-empty set whose elements ωΩ\omega \in \Omega are called outcomes. An outcome represents a complete specification of one possible result of the random experiment. The set Ω\Omega is the entire space of "what could happen."

Definition

Event Sigma-Algebra

A collection F\mathcal{F} of subsets of Ω\Omega is a sigma-algebra (or event space) if:

  1. ΩF\Omega \in \mathcal{F},
  2. AF    AcFA \in \mathcal{F} \implies A^c \in \mathcal{F} (closed under complements),
  3. A1,A2,F    n=1AnFA_1, A_2, \ldots \in \mathcal{F} \implies \bigcup_{n=1}^\infty A_n \in \mathcal{F} (closed under countable unions).

Elements of F\mathcal{F} are called events. Closure under complements and countable unions automatically gives closure under countable intersections, set differences, and limits.

Definition

Probability Measure

A probability measure is a function P:F[0,1]P : \mathcal{F} \to [0, 1] satisfying the three Kolmogorov axioms below. The triple (Ω,F,P)(\Omega, \mathcal{F}, P) is a probability space.

The reason events live in a sigma-algebra rather than 2Ω2^\Omega is that not every subset of an uncountable Ω\Omega can be assigned a probability consistently. The Vitali set on [0,1][0, 1] has no Lebesgue measure; trying to define PP on it gives a contradiction. The sigma-algebra is the largest collection of subsets on which PP can be defined coherently.

The Three Axioms

Theorem

Kolmogorov Axioms of Probability

Statement

P:FRP : \mathcal{F} \to \mathbb{R} is a probability measure if and only if it satisfies:

  1. Non-negativity. P(A)0P(A) \geq 0 for all AFA \in \mathcal{F}.
  2. Normalization. P(Ω)=1P(\Omega) = 1.
  3. Countable additivity. For any countable collection of pairwise disjoint events A1,A2,FA_1, A_2, \ldots \in \mathcal{F} (so AiAj=A_i \cap A_j = \emptyset for iji \neq j), P ⁣(n=1An)=n=1P(An).P\!\left(\bigcup_{n=1}^\infty A_n\right) = \sum_{n=1}^\infty P(A_n).

Intuition

Axiom 1 rules out negative probability. Axiom 2 fixes the total mass at 1 (otherwise we'd be doing measure theory, not probability theory). Axiom 3 says probability behaves like a mass: putting probability on a countable disjoint union is the same as adding the masses on each piece. The choice of countable (not just finite) additivity is what gives probability its analytic strength: it forces continuity properties used in every limit theorem.

Proof Sketch

This is a definition disguised as a theorem; there is nothing to prove. The content is in the consequences below, all of which follow from these three axioms by elementary set manipulation.

Why It Matters

Every result in probability and statistics derives from these three axioms plus the structure of the chosen (Ω,F)(\Omega, \mathcal{F}). The axioms are deliberately weak: they make no claim about how to assign probabilities to specific events, only about consistency requirements any such assignment must meet. This is what allows Bayesians, frequentists, and decision theorists to share a mathematical foundation while disagreeing on interpretation.

Failure Mode

A function satisfying only finite additivity (sums for finite disjoint unions) is a finitely additive probability, not a (countably additive) probability measure. Finitely additive probabilities exist on any algebra, but they fail the continuity properties below and break the dominated convergence theorem. Real-valued probability theory uses countable additivity because the analytic payoff (limit theorems, Lebesgue integration of expectations) is enormous. The cost is that not every subset of an uncountable Ω\Omega is an event.

Immediate Consequences

The next four properties follow directly from the three axioms. Every later page uses them without proof.

Probability of the empty set: P()=0P(\emptyset) = 0. Proof. Apply axiom 3 to Ω=Ω\Omega = \Omega \cup \emptyset \cup \emptyset \cup \cdots to get 1=1+P()+P()+1 = 1 + P(\emptyset) + P(\emptyset) + \cdots, forcing P()=0P(\emptyset) = 0.

Finite additivity: P(AB)=P(A)+P(B)P(A \cup B) = P(A) + P(B) when AB=A \cap B = \emptyset. Proof. Apply countable additivity to A1=AA_1 = A, A2=BA_2 = B, An=A_n = \emptyset for n3n \geq 3, using P()=0P(\emptyset) = 0.

Complement rule: P(Ac)=1P(A)P(A^c) = 1 - P(A). Proof. AA and AcA^c are disjoint with AAc=ΩA \cup A^c = \Omega, so P(A)+P(Ac)=P(Ω)=1P(A) + P(A^c) = P(\Omega) = 1.

Monotonicity: AB    P(A)P(B)A \subseteq B \implies P(A) \leq P(B). Proof. Write B=A(BA)B = A \cup (B \setminus A) as a disjoint union, so P(B)=P(A)+P(BA)P(A)P(B) = P(A) + P(B \setminus A) \geq P(A) by axiom 1.

A useful corollary: P(A)[0,1]P(A) \in [0, 1] for every event AA. The codomain [0,1][0, 1] in the definition is forced by the axioms, not assumed.

Inclusion-Exclusion

For finite unions of overlapping events, additivity needs a correction.

Theorem

Inclusion-Exclusion Principle

Statement

For events A1,,AnFA_1, \ldots, A_n \in \mathcal{F},

P ⁣(i=1nAi)=k=1n(1)k+1 ⁣ ⁣ ⁣ ⁣1i1<<ikn ⁣ ⁣ ⁣ ⁣P ⁣(Ai1Aik).P\!\left(\bigcup_{i=1}^n A_i\right) = \sum_{k=1}^n (-1)^{k+1} \!\!\!\!\sum_{1 \leq i_1 < \cdots < i_k \leq n} \!\!\!\! P\!\left(A_{i_1} \cap \cdots \cap A_{i_k}\right).

For n=2n = 2: P(A1A2)=P(A1)+P(A2)P(A1A2)P(A_1 \cup A_2) = P(A_1) + P(A_2) - P(A_1 \cap A_2). For n=3n = 3: P(A1A2A3)=P(A1)+P(A2)+P(A3)P(A1A2)P(A1A3)P(A2A3)+P(A1A2A3)P(A_1 \cup A_2 \cup A_3) = P(A_1) + P(A_2) + P(A_3) - P(A_1 \cap A_2) - P(A_1 \cap A_3) - P(A_2 \cap A_3) + P(A_1 \cap A_2 \cap A_3).

Intuition

Adding P(A1)+P(A2)P(A_1) + P(A_2) double-counts the overlap, so subtract P(A1A2)P(A_1 \cap A_2). With three sets, the three pairwise overlaps subtract too much from the triple overlap, so add it back. The alternating sign pattern generalizes this bookkeeping to any finite nn.

Proof Sketch

Induction on nn. Base case n=2n = 2 follows by writing A1A2=A1(A2A1)A_1 \cup A_2 = A_1 \cup (A_2 \setminus A_1) as a disjoint union and using P(A2A1)=P(A2)P(A1A2)P(A_2 \setminus A_1) = P(A_2) - P(A_1 \cap A_2). Inductive step: apply n=2n = 2 to A1An1A_1 \cup \cdots \cup A_{n-1} and AnA_n, distribute intersections, and collect.

Why It Matters

Inclusion-exclusion is the workhorse for computing union probabilities when you can compute intersections. It appears in derangement counts, the union bound (a one-term truncation), and inclusion-exclusion bounds in combinatorial probability. The Bonferroni inequalities are obtained by truncating the alternating sum after an even or odd number of terms, yielding lower or upper bounds.

Failure Mode

The number of terms grows as 2n12^n - 1. For large nn, computing every intersection probability is infeasible, and inclusion-exclusion becomes a theoretical tool rather than a computational one. The union bound P(iAi)iP(Ai)P(\bigcup_i A_i) \leq \sum_i P(A_i) is the cheap one-sided alternative used throughout learning theory.

Continuity of Probability

Countable additivity is equivalent (given finite additivity and non-negativity) to a continuity property: probabilities respect monotone limits of events.

Theorem

Continuity of Probability Measures

Statement

PP is countably additive (and hence a probability measure) if and only if both of the following hold:

Continuity from below. For any increasing sequence A1A2A_1 \subseteq A_2 \subseteq \cdots with AnFA_n \in \mathcal{F},

P ⁣(n=1An)=limnP(An).P\!\left(\bigcup_{n=1}^\infty A_n\right) = \lim_{n \to \infty} P(A_n).

Continuity from above. For any decreasing sequence B1B2B_1 \supseteq B_2 \supseteq \cdots with BnFB_n \in \mathcal{F},

P ⁣(n=1Bn)=limnP(Bn).P\!\left(\bigcap_{n=1}^\infty B_n\right) = \lim_{n \to \infty} P(B_n).

Intuition

"Increasing union" means the events grow to fill out their limit; the probabilities should grow to fill out the limit's probability. Without continuity, a sequence of events could grow to include "more and more" of Ω\Omega while their probabilities stayed pinned below the union's probability, which would break every limit argument in probability.

Proof Sketch

Decompose the increasing union as a countable disjoint union: A=A1(A2A1)(A3A2)A_\infty = A_1 \sqcup (A_2 \setminus A_1) \sqcup (A_3 \setminus A_2) \sqcup \cdots. Apply countable additivity: P(A)=P(A1)+nP(An+1An)P(A_\infty) = P(A_1) + \sum_n P(A_{n+1} \setminus A_n). The partial sums telescope to P(An)P(A_n), so the limit equals P(A)P(A_\infty). Continuity from above follows by passing to complements.

Why It Matters

This is what countable additivity buys you. Every "P(eventually)=limP(An)P(\text{eventually}) = \lim P(A_n)" argument, every interchange of limit and probability, every dominated convergence application for indicator functions, sits on this continuity. The Borel-Cantelli lemmas and the modes of convergence of random variables both rely on it.

Failure Mode

Continuity from above requires the events to be decreasing and at least one of them to have finite measure. For probability measures this is automatic (everything has measure at most 1), but for general measures (like Lebesgue measure on R\mathbb{R}) the assumption is necessary. Example: the sets Bn=[n,)B_n = [n, \infty) decrease to \emptyset but each has infinite Lebesgue measure.

Why Countable, Not Finite, Additivity

Finite additivity is the version most people guess when first writing down probability axioms. Why does the standard formulation insist on countable additivity?

Three reasons:

  1. Limit theorems. Without continuity from below, the law of large numbers cannot be stated as a single statement about a sequence of averages. The law of large numbers and the central limit theorem both produce countable disjoint unions in their proofs.

  2. Lebesgue integration. The expected value E[X]=XdP\mathbb{E}[X] = \int X \, dP is the Lebesgue integral against PP. Lebesgue's monotone and dominated convergence theorems require countable additivity. Without them, you cannot interchange lim\lim and E\mathbb{E}, which is required in nearly every consistency proof.

  3. Probability of unions of events with shrinking probability. Countable additivity is what guarantees that countably many "rare events" (each with small probability) cannot collectively sum to more than 1. This is the engine of the common inequalities used as union bounds throughout learning theory.

The price is the existence of non-measurable sets. On uncountable Ω\Omega, not every subset is in F\mathcal{F}. For all of probability and statistics, this is a fair trade.

Common Confusions

Watch Out

Probability zero is not the same as impossible

For a continuous random variable XX with density fXf_X, P(X=x)=0P(X = x) = 0 for every fixed xx, yet XX takes some value with probability 1. "Probability zero" means the event has measure zero, not that it cannot happen. Symmetric: "probability one" (almost sure) does not mean "always," only that the exceptional set has measure zero.

Watch Out

Sigma-algebras are not optional bookkeeping

Many introductory treatments hide the sigma-algebra to keep notation light, writing P(anything)P(\text{anything}) as if every subset were an event. This works on discrete or finite Ω\Omega, where you can take F=2Ω\mathcal{F} = 2^\Omega. On R\mathbb{R} or Rd\mathbb{R}^d, the only well-behaved choice is the Borel sigma-algebra generated by open sets, which excludes pathological sets like the Vitali construction. Pretending F=2Ω\mathcal{F} = 2^\Omega on R\mathbb{R} is what causes Banach-Tarski-style paradoxes when you try to define a uniform probability on [0,1][0, 1].

Watch Out

The axioms do not pick an interpretation

The axioms tell you what arithmetic probabilities must obey. They do not tell you whether P(coin lands heads)=0.5P(\text{coin lands heads}) = 0.5 means a long-run frequency, a betting rate, or a degree of belief. Frequentists, Bayesians, and subjectivists all use the same Kolmogorov axioms; they differ on what the numbers refer to. The mathematics is consistent across interpretations because the axioms are interpretation-free.

Summary

  • A probability space is a triple (Ω,F,P)(\Omega, \mathcal{F}, P): a sample space, an event sigma-algebra, and a probability measure.
  • The three axioms are non-negativity, normalization, and countable additivity.
  • Immediate consequences: P()=0P(\emptyset) = 0, complement rule, monotonicity, finite additivity.
  • Inclusion-exclusion handles finite unions of overlapping events; the union bound is its one-sided cheap relative.
  • Countable additivity is equivalent to continuity of probability for monotone sequences of events; this continuity is what makes limit theorems possible.
  • The axioms are silent on interpretation: frequentist, Bayesian, and classical accounts of probability all satisfy them.

Exercises

ExerciseCore

Problem

Let (Ω,F,P)(\Omega, \mathcal{F}, P) be a probability space and A,BFA, B \in \mathcal{F}. Prove that P(AB)P(A)+P(B)P(A \cup B) \leq P(A) + P(B) (the two-event union bound), with equality if and only if ABA \cap B has probability zero.

ExerciseAdvanced

Problem

Construct a finitely additive probability on N\mathbb{N} that is not countably additive, by assigning P({n})=0P(\{n\}) = 0 for every singleton nn but P(N)=1P(\mathbb{N}) = 1. (Such a PP exists, by appeal to the Hahn-Banach theorem or an ultrafilter on N\mathbb{N}.) Then explain which Kolmogorov axiom this violates and why it cannot be a probability measure in the standard sense.

References

Original:

  • Kolmogorov, "Grundbegriffe der Wahrscheinlichkeitsrechnung" (Springer, 1933); English translation "Foundations of the Theory of Probability" (Chelsea, 1956), Chapter 1

Standard graduate texts:

  • Billingsley, "Probability and Measure" (3rd edition, Wiley, 1995), Sections 2-3
  • Durrett, "Probability: Theory and Examples" (5th edition, Cambridge, 2019), Section 1.1
  • Williams, "Probability with Martingales" (Cambridge, 1991), Chapter 1
  • Resnick, "A Probability Path" (Birkhauser, 1999), Chapters 1-2

Real analysis perspective:

  • Folland, "Real Analysis: Modern Techniques and Their Applications" (2nd edition, Wiley, 1999), Chapter 1
  • Rudin, "Real and Complex Analysis" (3rd edition, McGraw-Hill, 1987), Chapter 1

Next Topics

Last reviewed: April 18, 2026

Prerequisites

Foundations this topic depends on.

Next Topics