Sets, Functions, and Relations

Every mathematical object in ML theory is built on sets and functions. A hypothesis class is a set. A loss function is a function. Feature maps, kernels, probability measures: all defined in terms of sets, functions, and relations. Without precise language for these, no theorem statement is unambiguous. This page assumes familiarity with basic logic and proof techniques.

The symbol appears in nearly every line below; follows close behind. A is the cleanest kind of function and the workhorse of cardinality arguments.

Core Definitions

Definition

Set $A, B, S$

A set is a collection of distinct objects. We write $x \in A$ to mean $x$ is an element of $A$ . The empty set is $\emptyset$ . Standard constructions: union $A \cup B$ , intersection $A \cap B$ , complement $A^c$ , difference $A \setminus B$ .

Definition

Cartesian Product $A \times B$

The Cartesian product of sets $A$ and $B$ is the set of all ordered pairs:

$A \times B = \{(a, b) : a \in A, \, b \in B\}$

More generally, $A_1 \times \cdots \times A_n$ is the set of $n$ -tuples. The space $\mathbb{R}^d$ is $\mathbb{R} \times \cdots \times \mathbb{R}$ ( $d$ times).

Definition

Function $f : A \to B$

A function $f: A \to B$ assigns to each element $a \in A$ exactly one element $f(a) \in B$ . The set $A$ is the domain, $B$ is the codomain. The image of $S \subseteq A$ is $f(S) = \{f(a) : a \in S\}$ . The preimage of $T \subseteq B$ is $f^{-1}(T) = \{a \in A : f(a) \in T\}$ .

Definition

Injective, Surjective, Bijective

A function $f: A \to B$ is:

Injective (one-to-one): $f(a_1) = f(a_2) \implies a_1 = a_2$
Surjective (onto): for every $b \in B$ , there exists $a \in A$ with $f(a) = b$
Bijective: both injective and surjective. A bijection has an inverse $f^{-1}: B \to A$ .

Definition

Equivalence Relation $\sim$

A relation $\sim$ on a set $A$ is an equivalence relation if and only if it is:

Reflexive: $a \sim a$ for all $a \in A$
Symmetric: $a \sim b \implies b \sim a$
Transitive: $a \sim b$ and $b \sim c \implies a \sim c$

The equivalence class of $a$ is $[a] = \{b \in A : b \sim a\}$ .

Definition

Quotient Set $A / \sim$

The quotient set $A / {\sim}$ is the set of all equivalence classes $\{[a] : a \in A\}$ . The equivalence classes partition $A$ into disjoint, exhaustive subsets.

Power Sets and Cardinality

Definition

Power Set $P (A)$

The power set of $A$ , written $\mathcal{P}(A)$ or $2^A$ , is the set of all subsets of $A$ :

$\mathcal{P}(A) = \{S : S \subseteq A\}$

If $|A| = n$ , then $|\mathcal{P}(A)| = 2^n$ . The power set always includes $\emptyset$ and $A$ itself. Power sets appear in measure theory when defining sigma-algebras: a sigma-algebra on $A$ is a subset of $\mathcal{P}(A)$ closed under complements and countable unions.

Definition

Cardinality $∣ A ∣$

Two sets $A$ and $B$ have the same cardinality (written $|A| = |B|$ ) if and only if there exists a bijection $f: A \to B$ . For finite sets, cardinality is the count of elements. For infinite sets, cardinality distinguishes countable sets (bijectable with $\mathbb{N}$ ) from uncountable ones (like $\mathbb{R}$ ). See cardinality and countability for a full treatment.

Function Composition and Inverses

Definition

Function Composition $g \circ f$

Given $f: A \to B$ and $g: B \to C$ , the composition $g \circ f: A \to C$ is defined by $(g \circ f)(a) = g(f(a))$ . Composition is associative: $h \circ (g \circ f) = (h \circ g) \circ f$ . It is not commutative in general.

Key facts about composition and injectivity/surjectivity:

If $g \circ f$ is injective, then $f$ is injective (but $g$ need not be).
If $g \circ f$ is surjective, then $g$ is surjective (but $f$ need not be).
If $f$ and $g$ are both bijective, then $g \circ f$ is bijective and $(g \circ f)^{-1} = f^{-1} \circ g^{-1}$ .

These facts are used constantly in linear algebra when reasoning about compositions of linear maps and matrix products.

Definition

Left and Right Inverses

A function $f: A \to B$ has a left inverse $g: B \to A$ if and only if $g \circ f = \text{id}_A$ , and a right inverse $h: B \to A$ if and only if $f \circ h = \text{id}_B$ . A function has a left inverse if and only if it is injective (assuming the axiom of choice for the converse). A function has a right inverse if and only if it is surjective (requires the axiom of choice).

Relations and Partial Orders

Definition

Relation $R \subseteq A \times B$

A relation from $A$ to $B$ is a subset $R \subseteq A \times B$ . We write $a \mathrel{R} b$ to mean $(a, b) \in R$ . A function $f: A \to B$ is a special case: a relation where each $a \in A$ appears in exactly one pair.

Definition

Partial Order $\leq$

A relation $\leq$ on a set $A$ is a partial order if and only if it is:

Reflexive: $a \leq a$ for all $a \in A$
Antisymmetric: $a \leq b$ and $b \leq a$ implies $a = b$
Transitive: $a \leq b$ and $b \leq c$ implies $a \leq c$

A set with a partial order is called a poset (partially ordered set). A total order is a partial order where every pair is comparable: for all $a, b$ , either $a \leq b$ or $b \leq a$ .

Partial orders appear throughout ML theory. The subset relation $\subseteq$ on $\mathcal{P}(A)$ is a partial order. The divisibility relation on $\mathbb{N}$ is a partial order. Lattices (posets where every pair has a least upper bound and greatest lower bound) appear in order-theoretic constructions throughout algebra, logic, and information theory.

Canonical Examples

Example

Injective, surjective, and bijective functions

Consider $f: \mathbb{R} \to \mathbb{R}$ defined by $f(x) = x^2$ .

Not injective: $f(2) = f(-2) = 4$ .
Not surjective: $-1$ has no preimage.

Restrict the domain: $g: [0, \infty) \to [0, \infty)$ defined by $g(x) = x^2$ . Now $g$ is bijective. The inverse is $g^{-1}(y) = \sqrt{y}$ .

This pattern (restricting domain and codomain to make a function bijective) is used when defining the inverse of activation functions and when constructing normalizing flows in generative models.

Example

Equivalence relations in ML

Define a relation on classifiers: $h_1 \sim h_2$ if $h_1(x) = h_2(x)$ for all $x$ in the training set. This is an equivalence relation (reflexive, symmetric, transitive). The equivalence classes partition the hypothesis class into groups of classifiers that agree on training data but may differ on unseen data. The VC dimension counts how many distinct equivalence classes exist when varying the training set.

Example

Cartesian products in feature spaces

If a model takes both an image feature vector $x \in \mathbb{R}^{512}$ and a text feature vector $y \in \mathbb{R}^{768}$ , the combined feature space is $\mathbb{R}^{512} \times \mathbb{R}^{768} = \mathbb{R}^{1280}$ . The Cartesian product formalizes feature concatenation. More generally, a dataset of $n$ examples with $d$ features lives in $(\mathbb{R}^d)^n$ , the $n$ -fold Cartesian product.

Main Theorems

Theorem

Cantor's Theorem

Statement

There is no surjection from $A$ onto its power set $\mathcal{P}(A)$ . In particular, $|A| < |\mathcal{P}(A)|$ for every set $A$ .

Intuition

No matter how large a set is, its collection of subsets is strictly larger. This is why $\mathbb{R}$ is uncountable: $|\mathbb{R}| = |\mathcal{P}(\mathbb{N})|$ and Cantor's theorem gives $|\mathbb{N}| < |\mathcal{P}(\mathbb{N})|$ .

Proof Sketch

Suppose $f: A \to \mathcal{P}(A)$ is surjective. Define $D = \{a \in A : a \notin f(a)\}$ . Since $f$ is surjective, $D = f(d)$ for some $d$ . Then $d \in D \iff d \notin f(d) = D$ , a contradiction.

Why It Matters

Cantor's theorem establishes that infinite sets come in different sizes. This underpins the distinction between countable and uncountable sets, which matters when defining probability measures and hypothesis classes.

Failure Mode

The proof is constructive and has no hidden assumptions. It works for all sets, finite or infinite. The only subtlety: it requires the axiom schema of specification from ZFC to form the set $D$ .

report a correction →

Theorem

Schroeder-Bernstein Theorem

Statement

If there exist injections $f: A \to B$ and $g: B \to A$ , then there exists a bijection $h: A \to B$ . Equivalently, $|A| \leq |B|$ and $|B| \leq |A|$ implies $|A| = |B|$ .

Intuition

If each set can be embedded into the other, they have the same size. This seems obvious for finite sets but is nontrivial for infinite sets. The proof constructs the bijection explicitly by partitioning $A$ into elements that "originate" from $A$ (use $f$ ) and elements that "originate" from $B$ (use $g^{-1}$ ).

Proof Sketch

Define chains by iterating $g \circ f$ starting from elements of $A \setminus g(B)$ . An element $a \in A$ is "A-originated" if it lies in such a chain and "B-originated" otherwise. Define $h(a) = f(a)$ if $a$ is A-originated and $h(a) = g^{-1}(a)$ if $a$ is B-originated (here $g^{-1}$ is well-defined on B-originated elements because they are in the image of $g$ ). Verify $h$ is a bijection.

Why It Matters

This theorem is the standard tool for proving two infinite sets have equal cardinality. Instead of constructing a bijection directly (often hard), you construct two injections (often easier). It is used in the proof that $|\mathbb{R}| = |\mathcal{P}(\mathbb{N})|$ and in arguments about the cardinality of hypothesis classes.

Failure Mode

The theorem does not extend to partial orders in general. Having $A \leq B$ and $B \leq A$ does not imply $A = B$ for arbitrary preorders. The result is specific to cardinality comparisons. Note also that the proof does not require the axiom of choice.

report a correction →

Common Confusions

Watch Out

Preimage is not the inverse function

The notation $f^{-1}(T)$ for preimage does not require $f$ to be invertible. Every function has preimages of sets. Only bijections have inverse functions.

Watch Out

Surjective onto the codomain, not the image

Every function is surjective onto its image. Surjectivity is a statement about the codomain. $f: \mathbb{R} \to \mathbb{R}$ defined by $f(x) = x^2$ is not surjective because $-1$ has no preimage, even though $f$ maps onto $[0, \infty)$ .

Watch Out

Equivalence relations are not partial orders

Both equivalence relations and partial orders are reflexive and transitive. The difference is the second axiom: equivalence relations are symmetric ( $a \sim b \implies b \sim a$ ), while partial orders are antisymmetric ( $a \leq b$ and $b \leq a$ implies $a = b$ ). These are mutually exclusive except for the identity relation. In probability, "same distribution" is an equivalence relation; "stochastically dominates" is a partial order.

Watch Out

A function is not its graph

Formally, a function $f: A \to B$ is often defined as its graph $\{(a, f(a)) : a \in A\} \subseteq A \times B$ together with the requirement that each $a$ appears in exactly one pair. But in practice, specifying domain and codomain matters. The function $f: \mathbb{R} \to \mathbb{R}$ given by $f(x) = x^2$ and the function $g: \mathbb{R} \to [0, \infty)$ given by $g(x) = x^2$ have the same graph but different properties: $g$ is surjective, $f$ is not.

Exercises

ExerciseCore

Problem

Let $f: A \to B$ and $g: B \to C$ . Prove that if $g \circ f$ is injective, then $f$ is injective.

ExerciseAdvanced

Problem

Let $\sim$ be an equivalence relation on $A$ and let $\pi: A \to A/{\sim}$ be the canonical projection $\pi(a) = [a]$ . Prove that $\pi$ is surjective. Under what condition is $\pi$ also injective?

ExerciseCore

Problem

Let $A = \{1, 2, 3\}$ . List all elements of $\mathcal{P}(A)$ . Verify that $|\mathcal{P}(A)| = 2^{|A|}$ .

ExerciseCore

Problem

Define $f: \mathbb{Z} \to \mathbb{N}$ by $f(n) = 2n$ if $n \geq 0$ and $f(n) = -2n - 1$ if $n < 0$ . Prove that $f$ is a bijection, establishing $|\mathbb{Z}| = |\mathbb{N}|$ .

ExerciseAdvanced

Problem

Let $f: A \to B$ be a function. Prove that for any subsets $S, T \subseteq A$ :

$f(S \cup T) = f(S) \cup f(T)$
$f(S \cap T) \subseteq f(S) \cap f(T)$
Give an example showing that equality need not hold in (2). Under what condition on $f$ does equality hold?

References

Canonical:

Halmos, Naive Set Theory (1960), Chapters 1-8. The standard concise introduction.
Enderton, Elements of Set Theory (1977), Chapters 1-3. More detailed than Halmos.
Munkres, Topology (2000), Chapter 1. Sets, functions, and relations at the level needed for analysis.

For ML context:

Shalev-Shwartz & Ben-David, Understanding Machine Learning, Appendix A
Axler, Linear Algebra Done Right (2024), Chapter 1A. Uses set-theoretic language for vector spaces.
Billingsley, Probability and Measure (1995), Chapter 1. Sets and sigma-algebras for measure theory.

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Basic Logic and Proof Techniqueslayer 0A · tier 2

Derived topics

21

Common Probability Distributionslayer 0A · tier 1
Differentiation in Rⁿlayer 0A · tier 1
Dynamic Programminglayer 0A · tier 1
Kolmogorov Probability Axiomslayer 0A · tier 1
Matrix Operations and Propertieslayer 0A · tier 1

+16 more on the derived-topics page.