Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Foundations

Relational Algebra

The mathematical foundation of SQL and relational databases. Selection, projection, join, set operations, and Codd's theorem connecting algebra to relational calculus.

CoreTier 2Stable~40 min
0

Why This Matters

Every SQL query you write is compiled into a relational algebra expression before execution. The query optimizer rewrites that expression using algebraic identities (pushing selections below joins, reordering joins by estimated cardinality) to find an efficient execution plan. Understanding relational algebra is understanding what the database engine actually computes.

For ML practitioners, relational algebra matters for two reasons. First, feature engineering pipelines are sequences of relational operations: filtering rows, selecting columns, joining tables, grouping and aggregating. Knowing the algebra clarifies what these operations do and how to compose them efficiently. Second, the theoretical result connecting relational algebra to relational calculus (Codd's theorem) is a clean example of an equivalence between procedural and declarative specifications, a pattern that recurs in programming language theory and logic.

Formal Setup

Definition

Relation

A relation RR with schema (A1:D1,A2:D2,,An:Dn)(A_1: D_1, A_2: D_2, \ldots, A_n: D_n) is a finite subset of the Cartesian product D1×D2××DnD_1 \times D_2 \times \cdots \times D_n. Each AiA_i is an attribute (column name) and DiD_i is its domain (the set of permitted values). Each element tRt \in R is a tuple (row). The arity of RR is nn (number of attributes). The cardinality R|R| is the number of tuples.

Two properties distinguish relations from arbitrary tables: (1) tuples are unordered (a relation is a set, not a list), and (2) all tuples are distinct (no duplicate rows). Real SQL databases relax both properties (rows have order, duplicates are allowed), which is why SQL operates on multisets (bags), not sets.

Definition

Relation Schema and Instance

A relation schema S=(A1:D1,,An:Dn)S = (A_1: D_1, \ldots, A_n: D_n) specifies attribute names and domains. A relation instance RR over schema SS is a finite set of tuples conforming to SS. A database schema is a collection of relation schemas. A database instance is a collection of relation instances, one for each schema.

The Five Fundamental Operations

Codd (1970) showed that five operations suffice to express all queries computable in the relational model. Every derived operation (join, intersection, division) can be written in terms of these five.

Definition

Selection

The selection operator filters tuples by a predicate:

σθ(R)={tR:θ(t)=true}\sigma_\theta(R) = \{t \in R : \theta(t) = \text{true}\}

where θ\theta is a Boolean formula over the attributes of RR, using comparisons (==, \neq, <<, >>, \leq, \geq) and logical connectives (\land, \lor, ¬\lnot).

SQL equivalent: WHERE clause. Example: σage>30(Employees)\sigma_{\text{age} > 30}(\text{Employees}) returns all employees older than 30.

Definition

Projection

The projection operator selects a subset of attributes and removes duplicates:

πA1,,Ak(R)={(t[A1],,t[Ak]):tR}\pi_{A_1, \ldots, A_k}(R) = \{(t[A_1], \ldots, t[A_k]) : t \in R\}

where t[Ai]t[A_i] denotes the value of attribute AiA_i in tuple tt. Since the result is a set, duplicate tuples are eliminated.

SQL equivalent: SELECT DISTINCT A1, ..., Ak. Example: πname, dept(Employees)\pi_{\text{name, dept}}(\text{Employees}) returns the set of (name, department) pairs.

Definition

Union

The union of two union-compatible relations (same schema) is:

RS={t:tR or tS}R \cup S = \{t : t \in R \text{ or } t \in S\}

Union-compatibility requires that RR and SS have the same number of attributes with matching domains. SQL equivalent: UNION.

Definition

Set Difference

The set difference of two union-compatible relations:

RS={t:tR and tS}R - S = \{t : t \in R \text{ and } t \notin S\}

SQL equivalent: EXCEPT. This is the only fundamental operation that is not monotone: adding tuples to SS can remove tuples from the result.

Definition

Cartesian Product

The Cartesian product concatenates every tuple in RR with every tuple in SS:

R×S={(tR,tS):tRR,tSS}R \times S = \{(t_R, t_S) : t_R \in R, t_S \in S\}

If RR has nn attributes and R|R| tuples, and SS has mm attributes and S|S| tuples, then R×SR \times S has n+mn + m attributes and R×S|R| \times |S| tuples. SQL equivalent: CROSS JOIN or listing multiple tables in FROM without a join condition.

Derived Operations

The following operations are defined in terms of the five fundamentals but are used so frequently that they have dedicated notation.

Definition

Natural Join

The natural join combines tuples from RR and SS that agree on all common attributes:

RS=πL(σR.A1=S.A1R.Ak=S.Ak(R×S))R \bowtie S = \pi_L(\sigma_{R.A_1 = S.A_1 \land \cdots \land R.A_k = S.A_k}(R \times S))

where A1,,AkA_1, \ldots, A_k are the attributes common to both schemas, and LL is the union of all attributes (without duplicating the common ones).

SQL equivalent: NATURAL JOIN or JOIN ... ON R.A = S.A.

Definition

Theta Join

The theta join is a Cartesian product followed by selection:

RθS=σθ(R×S)R \bowtie_\theta S = \sigma_\theta(R \times S)

When θ\theta uses only equality comparisons, this is an equijoin. The natural join is an equijoin on all shared attributes, followed by duplicate attribute removal.

Definition

Intersection

RS=R(RS)R \cap S = R - (R - S)

Equivalently, RS={t:tR and tS}R \cap S = \{t : t \in R \text{ and } t \in S\}. SQL equivalent: INTERSECT.

Definition

Division

The division R÷SR \div S returns tuples in the projection of RR that are associated with every tuple in SS. If RR has schema (A,B)(A, B) and SS has schema (B)(B):

R÷S={t[A]:tR}πA(({t[A]:tR}×S)R)R \div S = \{t[A] : t \in R\} - \pi_A((\{t[A] : t \in R\} \times S) - R)

Division answers "for all" queries: "find suppliers who supply ALL parts in SS." SQL has no direct DIVIDE BY operator; this requires a double negation pattern with NOT EXISTS.

Algebraic Identities and Query Optimization

The query optimizer uses algebraic equivalences to rewrite expressions into more efficient forms. The most important identities:

Selection pushdown: σθ1θ2(R)=σθ1(σθ2(R))\sigma_{\theta_1 \land \theta_2}(R) = \sigma_{\theta_1}(\sigma_{\theta_2}(R)) σθ(RS)=σθ(R)S(when θ involves only attributes of R)\sigma_\theta(R \bowtie S) = \sigma_\theta(R) \bowtie S \quad \text{(when } \theta \text{ involves only attributes of } R\text{)}

Pushing selection below a join is the single most important query optimization. If R=106|R| = 10^6 and the selection keeps 1% of rows, computing the join on 10410^4 rows instead of 10610^6 is a 100x speedup.

Projection pushdown: πL(RS)=πL(πL1(R)πL2(S))\pi_L(R \bowtie S) = \pi_L(\pi_{L_1}(R) \bowtie \pi_{L_2}(S))

where L1L_1 and L2L_2 include the join attributes plus the attributes in LL from each relation.

Join commutativity and associativity: RS=SRR \bowtie S = S \bowtie R (RS)T=R(ST)(R \bowtie S) \bowtie T = R \bowtie (S \bowtie T)

Join reordering is a combinatorial optimization problem. For nn tables, there are (2(n1))!(n1)!\frac{(2(n-1))!}{(n-1)!} possible join orderings (Catalan numbers). The optimizer uses dynamic programming (System R algorithm) or heuristics to find a good order.

Codd's Theorem

Theorem

Codd's Theorem: Equivalence of Relational Algebra and Relational Calculus

Statement

The relational algebra and the domain-independent relational calculus have exactly the same expressive power. For every relational algebra expression EE, there exists a domain-independent relational calculus formula ϕ\phi such that EE and ϕ\phi define the same query (produce the same result on every database instance). Conversely, for every domain-independent relational calculus formula ϕ\phi, there exists a relational algebra expression EE defining the same query.

Intuition

Relational algebra is procedural: it specifies a sequence of operations to compute the result. Relational calculus is declarative: it specifies a logical condition the result must satisfy. Codd's theorem says these two approaches are equivalent. Any query you can describe ("give me all employees who work in a department with budget over 1 million USD") can be computed by a sequence of selections, projections, joins, unions, and differences, and vice versa.

This is why SQL works: SQL's SELECT ... FROM ... WHERE syntax is based on relational calculus (declarative), but the database can compile it into relational algebra (procedural) without losing or gaining expressive power.

Proof Sketch

Algebra to Calculus (easier direction): Induction on the structure of algebra expressions. Each fundamental operation has a direct calculus translation:

  • σθ(R)\sigma_\theta(R) becomes {t:R(t)θ(t)}\{t : R(t) \land \theta(t)\}
  • πA1,,Ak(R)\pi_{A_1, \ldots, A_k}(R) becomes {(x1,,xk):y1,,ymR(x1,,xk,y1,,ym)}\{(x_1, \ldots, x_k) : \exists y_1, \ldots, y_m \, R(x_1, \ldots, x_k, y_1, \ldots, y_m)\}
  • RSR \cup S becomes {t:R(t)S(t)}\{t : R(t) \lor S(t)\}
  • RSR - S becomes {t:R(t)¬S(t)}\{t : R(t) \land \lnot S(t)\}
  • R×SR \times S becomes {(tR,tS):R(tR)S(tS)}\{(t_R, t_S) : R(t_R) \land S(t_S)\}

Calculus to Algebra (harder direction): Induction on formula structure, translating each logical connective and quantifier into algebraic operations. The existential quantifier \exists maps to projection. Universal quantification xϕ(x)\forall x \, \phi(x) is rewritten as ¬x¬ϕ(x)\lnot \exists x \, \lnot \phi(x), which maps to set difference. The critical restriction to domain-independent formulas ensures the result is always finite.

Why It Matters

Codd's theorem is the theoretical foundation of the entire relational database industry. It guarantees that SQL (based on calculus) and the query execution engine (based on algebra) are equivalent. The optimizer can freely rewrite between algebraic forms because the algebra is closed under equivalence-preserving transformations. Without this theorem, there is no guarantee that optimized queries return the same results as the original.

The theorem also defines the boundary of relational expressiveness. Queries that require recursion (transitive closure, reachability in graphs) are outside relational algebra and relational calculus. This motivated extensions like Datalog and SQL's recursive CTEs.

Failure Mode

The equivalence requires domain independence: calculus formulas must produce finite results regardless of the domain of interpretation. The formula {x:¬R(x)}\{x : \lnot R(x)\} is domain-dependent because its result depends on what values exist outside RR. If the domain is all integers, the result is infinite. Domain independence is undecidable in general, so practical systems restrict the syntax (safe-range calculus) to guarantee it.

The equivalence also assumes set semantics (no duplicates). SQL operates on bags (multisets), which have different algebraic laws. Bag relational algebra is strictly more expressive than set relational algebra for certain aggregate queries.

Beyond Relational Algebra

Datalog and recursion. Relational algebra cannot express recursive queries. The query "find all nodes reachable from node ss in a graph" requires transitive closure, which needs unbounded iteration. Datalog extends relational calculus with recursive rules. SQL added recursive common table expressions (WITH RECURSIVE) to handle this.

Aggregation. The standard five operations do not include aggregation (SUM\text{SUM}, COUNT\text{COUNT}, AVG\text{AVG}). Aggregation is added as a separate operator γ\gamma (gamma) in extended relational algebra: γG,agg(A)(R)\gamma_{G, \text{agg}(A)}(R) groups by attributes GG and applies aggregate function agg to attribute AA.

Relevance to ML pipelines. Feature engineering for tabular ML is a sequence of relational operations. Joining a customer table with a transactions table, filtering by date, grouping by customer and aggregating transaction counts: this is pure relational algebra. DataFrame libraries (pandas, Polars, Spark DataFrames) implement relational algebra over bags, with aggregation and window functions. Understanding the algebra helps reason about correctness (does this join produce duplicates?) and efficiency (should this filter go before or after the join?).

Common Confusions

Watch Out

Natural join is not Cartesian product

The Cartesian product R×SR \times S pairs every tuple in RR with every tuple in SS. The natural join RSR \bowtie S only keeps pairs that agree on shared attributes. If RR and SS share no attributes, the natural join degenerates to the Cartesian product. If they share all attributes, the natural join degenerates to intersection. The common case is between these extremes.

Watch Out

Projection removes duplicates (in set semantics)

In relational algebra (set semantics), π\pi always eliminates duplicate tuples. In SQL, SELECT does not remove duplicates; you need SELECT DISTINCT. This is because SQL uses bag semantics, where duplicates carry information (they represent multiplicity). The mismatch between set algebra and bag SQL is a common source of bugs.

Watch Out

Division is not the inverse of Cartesian product

R÷SR \div S does not undo R×SR \times S. Given RA×BR \subseteq A \times B and SBS \subseteq B, the division R÷SR \div S returns the set of aa values that are paired with ALL bb values in SS. It answers universal quantification queries, not inverse-product queries.

Key Takeaways

  • A relation is a finite set of tuples over a fixed schema
  • Five fundamental operations (selection, projection, union, set difference, Cartesian product) generate all relational queries
  • Join, intersection, and division are derived from the five fundamentals
  • Selection pushdown below joins is the single query optimization with the largest practical effect on runtime
  • Codd's theorem: relational algebra = domain-independent relational calculus in expressive power
  • Relational algebra cannot express recursion or aggregation; extensions are needed
  • SQL operates on bags (multisets), not sets; the algebraic laws differ

Exercises

ExerciseCore

Problem

Given relations R(A,B)={(1,2),(1,3),(2,3)}R(A, B) = \{(1, 2), (1, 3), (2, 3)\} and S(B,C)={(2,x),(3,y)}S(B, C) = \{(2, x), (3, y)\}, compute: (a) RSR \bowtie S (natural join on BB), (b) πA(RS)\pi_A(R \bowtie S), and (c) R÷πB(S)R \div \pi_B(S).

ExerciseCore

Problem

Write a relational algebra expression for the SQL query:

SELECT DISTINCT E.name
FROM Employees E, Departments D
WHERE E.dept_id = D.dept_id AND D.budget > 1000000
ExerciseAdvanced

Problem

Prove that intersection can be expressed using only union and set difference. Then prove that the five fundamental operations are not all independent: show that intersection can be derived from the other four, but set difference cannot be derived from selection, projection, union, Cartesian product, and intersection.

References

Canonical:

  • Codd, "A Relational Model of Data for Large Shared Data Banks" (1970), the founding paper of relational databases
  • Abiteboul, Hull, Vianu, Foundations of Databases (1995), Chapters 4-5, rigorous treatment of relational algebra and calculus
  • Ullman, Principles of Database and Knowledge-Base Systems, Vol. 1 (1988), Chapters 3-5

Textbooks:

  • Ramakrishnan & Gehrke, Database Management Systems (3rd ed., 2003), Chapters 4-5
  • Garcia-Molina, Ullman, Widom, Database Systems: The Complete Book (2nd ed., 2008), Chapters 2, 5
  • Silberschatz, Korth, Sudarshan, Database System Concepts (7th ed., 2019), Chapters 2, 6

Query Optimization:

  • Selinger et al., "Access Path Selection in a Relational Database Management System" (1979), the System R optimizer paper

Next Topics

Building on relational foundations:

  • Sets, functions, and relations: the set-theoretic foundations that relational algebra rests on
  • Basic logic and proof techniques: the logical formalism behind relational calculus

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.