Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Foundations

Sets, Functions, and Relations

The language underneath all of mathematics: sets, Cartesian products, functions, injectivity, surjectivity, equivalence relations, and quotient sets.

CoreTier 1Stable~40 min

Why This Matters

Every mathematical object in ML theory is built on sets and functions. A hypothesis class is a set. A loss function is a function. Feature maps, kernels, probability measures: all defined in terms of sets, functions, and relations. Without precise language for these, no theorem statement is unambiguous. This page assumes familiarity with basic logic and proof techniques.

Core Definitions

Definition

Set

A set is a collection of distinct objects. We write xAx \in A to mean xx is an element of AA. The empty set is \emptyset. Standard constructions: union ABA \cup B, intersection ABA \cap B, complement AcA^c, difference ABA \setminus B.

Definition

Cartesian Product

The Cartesian product of sets AA and BB is the set of all ordered pairs:

A×B={(a,b):aA,bB}A \times B = \{(a, b) : a \in A, \, b \in B\}

More generally, A1××AnA_1 \times \cdots \times A_n is the set of nn-tuples. The space Rd\mathbb{R}^d is R××R\mathbb{R} \times \cdots \times \mathbb{R} (dd times).

Definition

Function

A function f:ABf: A \to B assigns to each element aAa \in A exactly one element f(a)Bf(a) \in B. The set AA is the domain, BB is the codomain. The image of SAS \subseteq A is f(S)={f(a):aS}f(S) = \{f(a) : a \in S\}. The preimage of TBT \subseteq B is f1(T)={aA:f(a)T}f^{-1}(T) = \{a \in A : f(a) \in T\}.

Definition

Injective, Surjective, Bijective

A function f:ABf: A \to B is:

  • Injective (one-to-one): f(a1)=f(a2)    a1=a2f(a_1) = f(a_2) \implies a_1 = a_2
  • Surjective (onto): for every bBb \in B, there exists aAa \in A with f(a)=bf(a) = b
  • Bijective: both injective and surjective. A bijection has an inverse f1:BAf^{-1}: B \to A.
Definition

Equivalence Relation

A relation \sim on a set AA is an equivalence relation if it is:

  1. Reflexive: aaa \sim a for all aAa \in A
  2. Symmetric: ab    baa \sim b \implies b \sim a
  3. Transitive: aba \sim b and bc    acb \sim c \implies a \sim c

The equivalence class of aa is [a]={bA:ba}[a] = \{b \in A : b \sim a\}.

Definition

Quotient Set

The quotient set A/A / {\sim} is the set of all equivalence classes {[a]:aA}\{[a] : a \in A\}. The equivalence classes partition AA into disjoint, exhaustive subsets.

Power Sets and Cardinality

Definition

Power Set

The power set of AA, written P(A)\mathcal{P}(A) or 2A2^A, is the set of all subsets of AA:

P(A)={S:SA}\mathcal{P}(A) = \{S : S \subseteq A\}

If A=n|A| = n, then P(A)=2n|\mathcal{P}(A)| = 2^n. The power set always includes \emptyset and AA itself. Power sets appear in measure theory when defining sigma-algebras: a sigma-algebra on AA is a subset of P(A)\mathcal{P}(A) closed under complements and countable unions.

Definition

Cardinality

Two sets AA and BB have the same cardinality (written A=B|A| = |B|) if there exists a bijection f:ABf: A \to B. For finite sets, cardinality is the count of elements. For infinite sets, cardinality distinguishes countable sets (bijectable with N\mathbb{N}) from uncountable ones (like R\mathbb{R}). See cardinality and countability for a full treatment.

Function Composition and Inverses

Definition

Function Composition

Given f:ABf: A \to B and g:BCg: B \to C, the composition gf:ACg \circ f: A \to C is defined by (gf)(a)=g(f(a))(g \circ f)(a) = g(f(a)). Composition is associative: h(gf)=(hg)fh \circ (g \circ f) = (h \circ g) \circ f. It is not commutative in general.

Key facts about composition and injectivity/surjectivity:

  • If gfg \circ f is injective, then ff is injective (but gg need not be).
  • If gfg \circ f is surjective, then gg is surjective (but ff need not be).
  • If ff and gg are both bijective, then gfg \circ f is bijective and (gf)1=f1g1(g \circ f)^{-1} = f^{-1} \circ g^{-1}.

These facts are used constantly in linear algebra when reasoning about compositions of linear maps and matrix products.

Definition

Left and Right Inverses

A function f:ABf: A \to B has a left inverse g:BAg: B \to A if gf=idAg \circ f = \text{id}_A, and a right inverse h:BAh: B \to A if fh=idBf \circ h = \text{id}_B. A function has a left inverse if and only if it is injective (assuming the axiom of choice for the converse). A function has a right inverse if and only if it is surjective (requires the axiom of choice).

Relations and Partial Orders

Definition

Relation

A relation from AA to BB is a subset RA×BR \subseteq A \times B. We write aRba \mathrel{R} b to mean (a,b)R(a, b) \in R. A function f:ABf: A \to B is a special case: a relation where each aAa \in A appears in exactly one pair.

Definition

Partial Order

A relation \leq on a set AA is a partial order if it is:

  1. Reflexive: aaa \leq a for all aAa \in A
  2. Antisymmetric: aba \leq b and bab \leq a implies a=ba = b
  3. Transitive: aba \leq b and bcb \leq c implies aca \leq c

A set with a partial order is called a poset (partially ordered set). A total order is a partial order where every pair is comparable: for all a,ba, b, either aba \leq b or bab \leq a.

Partial orders appear throughout ML theory. The subset relation \subseteq on P(A)\mathcal{P}(A) is a partial order. The divisibility relation on N\mathbb{N} is a partial order. Lattices (posets where every pair has a least upper bound and greatest lower bound) appear in formal concept analysis and information theory.

Canonical Examples

Example

Injective, surjective, and bijective functions

Consider f:RRf: \mathbb{R} \to \mathbb{R} defined by f(x)=x2f(x) = x^2.

  • Not injective: f(2)=f(2)=4f(2) = f(-2) = 4.
  • Not surjective: 1-1 has no preimage.

Restrict the domain: g:[0,)[0,)g: [0, \infty) \to [0, \infty) defined by g(x)=x2g(x) = x^2. Now gg is bijective. The inverse is g1(y)=yg^{-1}(y) = \sqrt{y}.

This pattern (restricting domain and codomain to make a function bijective) is used when defining the inverse of activation functions and when constructing normalizing flows in generative models.

Example

Equivalence relations in ML

Define a relation on classifiers: h1h2h_1 \sim h_2 if h1(x)=h2(x)h_1(x) = h_2(x) for all xx in the training set. This is an equivalence relation (reflexive, symmetric, transitive). The equivalence classes partition the hypothesis class into groups of classifiers that agree on training data but may differ on unseen data. The VC dimension counts how many distinct equivalence classes exist when varying the training set.

Example

Cartesian products in feature spaces

If a model takes both an image feature vector xR512x \in \mathbb{R}^{512} and a text feature vector yR768y \in \mathbb{R}^{768}, the combined feature space is R512×R768=R1280\mathbb{R}^{512} \times \mathbb{R}^{768} = \mathbb{R}^{1280}. The Cartesian product formalizes feature concatenation. More generally, a dataset of nn examples with dd features lives in (Rd)n(\mathbb{R}^d)^n, the nn-fold Cartesian product.

Main Theorems

Theorem

Cantor's Theorem

Statement

There is no surjection from AA onto its power set P(A)\mathcal{P}(A). In particular, A<P(A)|A| < |\mathcal{P}(A)| for every set AA.

Intuition

No matter how large a set is, its collection of subsets is strictly larger. This is why R\mathbb{R} is uncountable: R=P(N)|\mathbb{R}| = |\mathcal{P}(\mathbb{N})| and Cantor's theorem gives N<P(N)|\mathbb{N}| < |\mathcal{P}(\mathbb{N})|.

Proof Sketch

Suppose f:AP(A)f: A \to \mathcal{P}(A) is surjective. Define D={aA:af(a)}D = \{a \in A : a \notin f(a)\}. Since ff is surjective, D=f(d)D = f(d) for some dd. Then dD    df(d)=Dd \in D \iff d \notin f(d) = D, a contradiction.

Why It Matters

Cantor's theorem establishes that infinite sets come in different sizes. This underpins the distinction between countable and uncountable sets, which matters when defining probability measures and hypothesis classes.

Failure Mode

The proof is constructive and has no hidden assumptions. It works for all sets, finite or infinite. The only subtlety: it requires the axiom schema of specification from ZFC to form the set DD.

Theorem

Schroeder-Bernstein Theorem

Statement

If there exist injections f:ABf: A \to B and g:BAg: B \to A, then there exists a bijection h:ABh: A \to B. Equivalently, AB|A| \leq |B| and BA|B| \leq |A| implies A=B|A| = |B|.

Intuition

If each set can be embedded into the other, they have the same size. This seems obvious for finite sets but is nontrivial for infinite sets. The proof constructs the bijection explicitly by partitioning AA into elements that "originate" from AA (use ff) and elements that "originate" from BB (use g1g^{-1}).

Proof Sketch

Define chains by iterating gfg \circ f starting from elements of Ag(B)A \setminus g(B). An element aAa \in A is "A-originated" if it lies in such a chain and "B-originated" otherwise. Define h(a)=f(a)h(a) = f(a) if aa is A-originated and h(a)=g1(a)h(a) = g^{-1}(a) if aa is B-originated (here g1g^{-1} is well-defined on B-originated elements because they are in the image of gg). Verify hh is a bijection.

Why It Matters

This theorem is the standard tool for proving two infinite sets have equal cardinality. Instead of constructing a bijection directly (often hard), you construct two injections (often easier). It is used in the proof that R=P(N)|\mathbb{R}| = |\mathcal{P}(\mathbb{N})| and in arguments about the cardinality of hypothesis classes.

Failure Mode

The theorem does not extend to partial orders in general. Having ABA \leq B and BAB \leq A does not imply A=BA = B for arbitrary preorders. The result is specific to cardinality comparisons. Note also that the proof does not require the axiom of choice.

Common Confusions

Watch Out

Preimage is not the inverse function

The notation f1(T)f^{-1}(T) for preimage does not require ff to be invertible. Every function has preimages of sets. Only bijections have inverse functions.

Watch Out

Surjective onto the codomain, not the image

Every function is surjective onto its image. Surjectivity is a statement about the codomain. f:RRf: \mathbb{R} \to \mathbb{R} defined by f(x)=x2f(x) = x^2 is not surjective because 1-1 has no preimage, even though ff maps onto [0,)[0, \infty).

Watch Out

Equivalence relations are not partial orders

Both equivalence relations and partial orders are reflexive and transitive. The difference is the second axiom: equivalence relations are symmetric (ab    baa \sim b \implies b \sim a), while partial orders are antisymmetric (aba \leq b and bab \leq a implies a=ba = b). These are mutually exclusive except for the identity relation. In probability, "same distribution" is an equivalence relation; "stochastically dominates" is a partial order.

Watch Out

A function is not its graph

Formally, a function f:ABf: A \to B is often defined as its graph {(a,f(a)):aA}A×B\{(a, f(a)) : a \in A\} \subseteq A \times B together with the requirement that each aa appears in exactly one pair. But in practice, specifying domain and codomain matters. The function f:RRf: \mathbb{R} \to \mathbb{R} given by f(x)=x2f(x) = x^2 and the function g:R[0,)g: \mathbb{R} \to [0, \infty) given by g(x)=x2g(x) = x^2 have the same graph but different properties: gg is surjective, ff is not.

Exercises

ExerciseCore

Problem

Let f:ABf: A \to B and g:BCg: B \to C. Prove that if gfg \circ f is injective, then ff is injective.

ExerciseAdvanced

Problem

Let \sim be an equivalence relation on AA and let π:AA/\pi: A \to A/{\sim} be the canonical projection π(a)=[a]\pi(a) = [a]. Prove that π\pi is surjective. Under what condition is π\pi also injective?

ExerciseCore

Problem

Let A={1,2,3}A = \{1, 2, 3\}. List all elements of P(A)\mathcal{P}(A). Verify that P(A)=2A|\mathcal{P}(A)| = 2^{|A|}.

ExerciseCore

Problem

Define f:ZNf: \mathbb{Z} \to \mathbb{N} by f(n)=2nf(n) = 2n if n0n \geq 0 and f(n)=2n1f(n) = -2n - 1 if n<0n < 0. Prove that ff is a bijection, establishing Z=N|\mathbb{Z}| = |\mathbb{N}|.

ExerciseAdvanced

Problem

Let f:ABf: A \to B be a function. Prove that for any subsets S,TAS, T \subseteq A:

  1. f(ST)=f(S)f(T)f(S \cup T) = f(S) \cup f(T)
  2. f(ST)f(S)f(T)f(S \cap T) \subseteq f(S) \cap f(T)
  3. Give an example showing that equality need not hold in (2). Under what condition on ff does equality hold?

References

Canonical:

  • Halmos, Naive Set Theory (1960), Chapters 1-8. The standard concise introduction.
  • Enderton, Elements of Set Theory (1977), Chapters 1-3. More detailed than Halmos.
  • Munkres, Topology (2000), Chapter 1. Sets, functions, and relations at the level needed for analysis.

For ML context:

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning, Appendix A
  • Axler, Linear Algebra Done Right (2024), Chapter 1A. Uses set-theoretic language for vector spaces.
  • Billingsley, Probability and Measure (1995), Chapter 1. Sets and sigma-algebras for measure theory.

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics