Relational Algebra

Sneiderman, Robby

Foundations

Relational Algebra

The mathematical foundation of SQL and relational databases. Selection, projection, join, set operations, and Codd's theorem connecting algebra to relational calculus.

CoreTier 2StableReference~40 min

Prerequisites

Sets Functions and Relations Basic Logic and Proof Techniques

Prereq Map

Learning position

Read this page in the graph.

foundations | layer 1 | tier 2. This page has 2 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Every SQL query you write is compiled into a relational algebra expression before execution. The query optimizer rewrites that expression using algebraic identities (pushing selections below joins, reordering joins by estimated cardinality) to find an efficient execution plan. Understanding relational algebra is understanding what the database engine actually computes.

For ML practitioners, relational algebra matters for two reasons. First, feature engineering pipelines are sequences of relational operations: filtering rows, selecting columns, joining tables, grouping and aggregating. Knowing the algebra clarifies what these operations do and how to compose them efficiently. Second, the theoretical result connecting relational algebra to relational calculus (Codd's theorem) is a clean example of an equivalence between procedural and declarative specifications, a pattern that recurs in programming language theory and logic.

Formal Setup

Definition

Relation $R \subseteq D_{1} \times D_{2} \times \dots \times D_{n}$

A relation $R$ with schema $(A_1: D_1, A_2: D_2, \ldots, A_n: D_n)$ is a finite subset of the Cartesian product $D_1 \times D_2 \times \cdots \times D_n$ . Each $A_i$ is an attribute (column name) and $D_i$ is its domain (the set of permitted values). Each element $t \in R$ is a tuple (row). The arity of $R$ is $n$ (number of attributes). The cardinality $|R|$ is the number of tuples.

Two properties distinguish relations from arbitrary tables: (1) tuples are unordered (a relation is a set, not a list), and (2) all tuples are distinct (no duplicate rows). Real SQL databases relax both properties (rows have order, duplicates are allowed), which is why SQL operates on multisets (bags), not sets.

Definition

Relation Schema and Instance

A relation schema $S = (A_1: D_1, \ldots, A_n: D_n)$ specifies attribute names and domains. A relation instance $R$ over schema $S$ is a finite set of tuples conforming to $S$ . A database schema is a collection of relation schemas. A database instance is a collection of relation instances, one for each schema.

The Five Fundamental Operations

Codd (1970) showed that five operations suffice to express all queries computable in the relational model. Every derived operation (join, intersection, division) can be written in terms of these five.

Definition

Selection $σ_{θ} (R)$

The selection operator filters tuples by a predicate:

$\sigma_\theta(R) = \{t \in R : \theta(t) = \text{true}\}$

where $\theta$ is a Boolean formula over the attributes of $R$ , using comparisons ( $=$ , $\neq$ , $<$ , $>$ , $\leq$ , $\geq$ ) and logical connectives ( $\land$ , $\lor$ , $\lnot$ ).

SQL equivalent: WHERE clause. Example: $\sigma_{\text{age} > 30}(\text{Employees})$ returns all employees older than 30.

Definition

Projection $π_{A_{1}, \dots, A_{k}} (R)$

The projection operator selects a subset of attributes and removes duplicates:

$\pi_{A_1, \ldots, A_k}(R) = \{(t[A_1], \ldots, t[A_k]) : t \in R\}$

where $t[A_i]$ denotes the value of attribute $A_i$ in tuple $t$ . Since the result is a set, duplicate tuples are eliminated.

SQL equivalent: SELECT DISTINCT A1, ..., Ak. Example: $\pi_{\text{name, dept}}(\text{Employees})$ returns the set of (name, department) pairs.

Definition

Union $R \cup S$

The union of two union-compatible relations (same schema) is:

$R \cup S = \{t : t \in R \text{ or } t \in S\}$

Union-compatibility requires that $R$ and $S$ have the same number of attributes with matching domains. SQL equivalent: UNION.

Definition

Set Difference $R - S$

The set difference of two union-compatible relations:

$R - S = \{t : t \in R \text{ and } t \notin S\}$

SQL equivalent: EXCEPT. This is the only fundamental operation that is not monotone: adding tuples to $S$ can remove tuples from the result.

Definition

Cartesian Product $R \times S$

The Cartesian product concatenates every tuple in $R$ with every tuple in $S$ :

$R \times S = \{(t_R, t_S) : t_R \in R, t_S \in S\}$

If $R$ has $n$ attributes and $|R|$ tuples, and $S$ has $m$ attributes and $|S|$ tuples, then $R \times S$ has $n + m$ attributes and $|R| \times |S|$ tuples. SQL equivalent: CROSS JOIN or listing multiple tables in FROM without a join condition.

Derived Operations

The following operations are defined in terms of the five fundamentals but are used so frequently that they have dedicated notation.

Definition

Natural Join $R ⋈ S$

The natural join combines tuples from $R$ and $S$ that agree on all common attributes:

$R \bowtie S = \pi_L(\sigma_{R.A_1 = S.A_1 \land \cdots \land R.A_k = S.A_k}(R \times S))$

where $A_1, \ldots, A_k$ are the attributes common to both schemas, and $L$ is the union of all attributes (without duplicating the common ones).

SQL equivalent: NATURAL JOIN or JOIN ... ON R.A = S.A.

Definition

Theta Join $R ⋈_{θ} S$

The theta join is a Cartesian product followed by selection:

$R \bowtie_\theta S = \sigma_\theta(R \times S)$

When $\theta$ uses only equality comparisons, this is an equijoin. The natural join is an equijoin on all shared attributes, followed by duplicate attribute removal.

Definition

Intersection $R \cap S$

$R \cap S = R - (R - S)$

Equivalently, $R \cap S = \{t : t \in R \text{ and } t \in S\}$ . SQL equivalent: INTERSECT.

Definition

Division $R \div S$

The division $R \div S$ returns tuples in the projection of $R$ that are associated with every tuple in $S$ . If $R$ has schema $(A, B)$ and $S$ has schema $(B)$ :

$R \div S = \{t[A] : t \in R\} - \pi_A((\{t[A] : t \in R\} \times S) - R)$

Division answers "for all" queries: "find suppliers who supply ALL parts in $S$ ." SQL has no direct DIVIDE BY operator; this requires a double negation pattern with NOT EXISTS.

Algebraic Identities and Query Optimization

The query optimizer uses algebraic equivalences to rewrite expressions into more efficient forms. The most important identities:

Selection pushdown: $\sigma_{\theta_1 \land \theta_2}(R) = \sigma_{\theta_1}(\sigma_{\theta_2}(R))$ $\sigma_\theta(R \bowtie S) = \sigma_\theta(R) \bowtie S \quad \text{(when } \theta \text{ involves only attributes of } R\text{)}$

Pushing selection below a join is the single most important query optimization. If $|R| = 10^6$ and the selection keeps 1% of rows, computing the join on $10^4$ rows instead of $10^6$ is a 100x speedup.

Projection pushdown: $\pi_L(R \bowtie S) = \pi_L(\pi_{L_1}(R) \bowtie \pi_{L_2}(S))$

where $L_1$ and $L_2$ include the join attributes plus the attributes in $L$ from each relation.

Join commutativity and associativity: $R \bowtie S = S \bowtie R$ $(R \bowtie S) \bowtie T = R \bowtie (S \bowtie T)$

Join reordering is a combinatorial optimization problem. For $n$ tables, there are $\frac{(2(n-1))!}{(n-1)!}$ possible join orderings (Catalan numbers). The optimizer uses dynamic programming (System R algorithm) or heuristics to find a good order.

Codd's Theorem

Theorem

Codd's Theorem: Equivalence of Relational Algebra and Relational Calculus

Statement

The relational algebra and the domain-independent relational calculus have exactly the same expressive power. For every relational algebra expression $E$ , there exists a domain-independent relational calculus formula $\phi$ such that $E$ and $\phi$ define the same query (produce the same result on every database instance). Conversely, for every domain-independent relational calculus formula $\phi$ , there exists a relational algebra expression $E$ defining the same query.

Intuition

Relational algebra is procedural: it specifies a sequence of operations to compute the result. Relational calculus is declarative: it specifies a logical condition the result must satisfy. Codd's theorem says these two approaches are equivalent. Any query you can describe ("give me all employees who work in a department with budget over 1 million USD") can be computed by a sequence of selections, projections, joins, unions, and differences, and vice versa.

This is why SQL works: SQL's SELECT ... FROM ... WHERE syntax is based on relational calculus (declarative), but the database can compile it into relational algebra (procedural) without losing or gaining expressive power.

Proof Sketch

Algebra to Calculus (easier direction): Induction on the structure of algebra expressions. Each fundamental operation has a direct calculus translation:

$\sigma_\theta(R)$ becomes $\{t : R(t) \land \theta(t)\}$
$\pi_{A_1, \ldots, A_k}(R)$ becomes $\{(x_1, \ldots, x_k) : \exists y_1, \ldots, y_m \, R(x_1, \ldots, x_k, y_1, \ldots, y_m)\}$
$R \cup S$ becomes $\{t : R(t) \lor S(t)\}$
$R - S$ becomes $\{t : R(t) \land \lnot S(t)\}$
$R \times S$ becomes $\{(t_R, t_S) : R(t_R) \land S(t_S)\}$

Calculus to Algebra (harder direction): Induction on formula structure, translating each logical connective and quantifier into algebraic operations. The existential quantifier $\exists$ maps to projection. Universal quantification $\forall x \, \phi(x)$ is rewritten as $\lnot \exists x \, \lnot \phi(x)$ , which maps to set difference. The critical restriction to domain-independent formulas ensures the result is always finite.

Why It Matters

Codd's theorem is the theoretical foundation of the entire relational database industry. It guarantees that SQL (based on calculus) and the query execution engine (based on algebra) are equivalent. The optimizer can freely rewrite between algebraic forms because the algebra is closed under equivalence-preserving transformations. Without this theorem, there is no guarantee that optimized queries return the same results as the original.

The theorem also defines the boundary of relational expressiveness. Queries that require recursion (transitive closure, reachability in graphs) are outside relational algebra and relational calculus. This motivated extensions like Datalog and SQL's recursive CTEs.

Failure Mode

The equivalence requires domain independence: calculus formulas must produce finite results regardless of the domain of interpretation. The formula $\{x : \lnot R(x)\}$ is domain-dependent because its result depends on what values exist outside $R$ . If the domain is all integers, the result is infinite. Domain independence is undecidable in general, so practical systems restrict the syntax (safe-range calculus) to guarantee it.

The equivalence also assumes set semantics (no duplicates). SQL operates on bags (multisets), which have different algebraic laws. Bag relational algebra is strictly more expressive than set relational algebra for certain aggregate queries.

report a correction →

Beyond Relational Algebra

Datalog and recursion. Relational algebra cannot express recursive queries. The query "find all nodes reachable from node $s$ in a graph" requires transitive closure, which needs unbounded iteration. Datalog extends relational calculus with recursive rules. SQL added recursive common table expressions (WITH RECURSIVE) to handle this.

Aggregation. The standard five operations do not include aggregation ( $\text{SUM}$ , $\text{COUNT}$ , $\text{AVG}$ ). Aggregation is added as a separate operator $\gamma$ (gamma) in extended relational algebra: $\gamma_{G, \text{agg}(A)}(R)$ groups by attributes $G$ and applies aggregate function agg to attribute $A$ .

Relevance to ML pipelines. Feature engineering for tabular ML is a sequence of relational operations. Joining a customer table with a transactions table, filtering by date, grouping by customer and aggregating transaction counts: this is pure relational algebra. DataFrame libraries (pandas, Polars, Spark DataFrames) implement relational algebra over bags, with aggregation and window functions. Understanding the algebra helps reason about correctness (does this join produce duplicates?) and efficiency (should this filter go before or after the join?).

Common Confusions

Watch Out

Natural join is not Cartesian product

The Cartesian product $R \times S$ pairs every tuple in $R$ with every tuple in $S$ . The natural join $R \bowtie S$ only keeps pairs that agree on shared attributes. If $R$ and $S$ share no attributes, the natural join degenerates to the Cartesian product. If they share all attributes, the natural join degenerates to intersection. The common case is between these extremes.

Watch Out

Projection removes duplicates (in set semantics)

In relational algebra (set semantics), $\pi$ always eliminates duplicate tuples. In SQL, SELECT does not remove duplicates; you need SELECT DISTINCT. This is because SQL uses bag semantics, where duplicates carry information (they represent multiplicity). The mismatch between set algebra and bag SQL is a common source of bugs.

Watch Out

Division is not the inverse of Cartesian product

$R \div S$ does not undo $R \times S$ . Given $R \subseteq A \times B$ and $S \subseteq B$ , the division $R \div S$ returns the set of $a$ values that are paired with ALL $b$ values in $S$ . It answers universal quantification queries, not inverse-product queries.

Summary

A relation is a finite set of tuples over a fixed schema
Five fundamental operations (selection, projection, union, set difference, Cartesian product) generate all relational queries
Join, intersection, and division are derived from the five fundamentals
Selection pushdown below joins is the single query optimization with the largest practical effect on runtime
Codd's theorem: relational algebra = domain-independent relational calculus in expressive power
Relational algebra cannot express recursion or aggregation; extensions are needed
SQL operates on bags (multisets), not sets; the algebraic laws differ

Exercises

ExerciseCore

Problem

Given relations $R(A, B) = \{(1, 2), (1, 3), (2, 3)\}$ and $S(B, C) = \{(2, x), (3, y)\}$ , compute: (a) $R \bowtie S$ (natural join on $B$ ), (b) $\pi_A(R \bowtie S)$ , and (c) $R \div \pi_B(S)$ .

ExerciseCore

Problem

Write a relational algebra expression for the SQL query:

SELECT DISTINCT E.name
FROM Employees E, Departments D
WHERE E.dept_id = D.dept_id AND D.budget > 1000000

ExerciseAdvanced

Problem

Prove that intersection can be expressed using only union and set difference. Then prove that the five fundamental operations are not all independent: show that intersection can be derived from the other four, but set difference cannot be derived from selection, projection, union, Cartesian product, and intersection.

References

Canonical:

Codd, "A Relational Model of Data for Large Shared Data Banks" (1970), the founding paper of relational databases
Abiteboul, Hull, Vianu, Foundations of Databases (1995), Chapters 4-5, rigorous treatment of relational algebra and calculus
Ullman, Principles of Database and Knowledge-Base Systems, Vol. 1 (1988), Chapters 3-5

Textbooks:

Ramakrishnan & Gehrke, Database Management Systems (3rd ed., 2003), Chapters 4-5
Garcia-Molina, Ullman, Widom, Database Systems: The Complete Book (2nd ed., 2008), Chapters 2, 5
Silberschatz, Korth, Sudarshan, Database System Concepts (7th ed., 2019), Chapters 2, 6

Query Optimization:

Selinger et al., "Access Path Selection in a Relational Database Management System" (1979), the System R optimizer paper

Next Topics

Building on relational foundations:

Sets, functions, and relations: the set-theoretic foundations that relational algebra rests on
Basic logic and proof techniques: the logical formalism behind relational calculus

Last reviewed: April 15, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Sets, Functions, and Relationslayer 0A · tier 1
Basic Logic and Proof Techniqueslayer 0A · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.