Foundations
Relational Algebra
The mathematical foundation of SQL and relational databases. Selection, projection, join, set operations, and Codd's theorem connecting algebra to relational calculus.
Why This Matters
Every SQL query you write is compiled into a relational algebra expression before execution. The query optimizer rewrites that expression using algebraic identities (pushing selections below joins, reordering joins by estimated cardinality) to find an efficient execution plan. Understanding relational algebra is understanding what the database engine actually computes.
For ML practitioners, relational algebra matters for two reasons. First, feature engineering pipelines are sequences of relational operations: filtering rows, selecting columns, joining tables, grouping and aggregating. Knowing the algebra clarifies what these operations do and how to compose them efficiently. Second, the theoretical result connecting relational algebra to relational calculus (Codd's theorem) is a clean example of an equivalence between procedural and declarative specifications, a pattern that recurs in programming language theory and logic.
Formal Setup
Relation
A relation with schema is a finite subset of the Cartesian product . Each is an attribute (column name) and is its domain (the set of permitted values). Each element is a tuple (row). The arity of is (number of attributes). The cardinality is the number of tuples.
Two properties distinguish relations from arbitrary tables: (1) tuples are unordered (a relation is a set, not a list), and (2) all tuples are distinct (no duplicate rows). Real SQL databases relax both properties (rows have order, duplicates are allowed), which is why SQL operates on multisets (bags), not sets.
Relation Schema and Instance
A relation schema specifies attribute names and domains. A relation instance over schema is a finite set of tuples conforming to . A database schema is a collection of relation schemas. A database instance is a collection of relation instances, one for each schema.
The Five Fundamental Operations
Codd (1970) showed that five operations suffice to express all queries computable in the relational model. Every derived operation (join, intersection, division) can be written in terms of these five.
Selection
The selection operator filters tuples by a predicate:
where is a Boolean formula over the attributes of , using comparisons (, , , , , ) and logical connectives (, , ).
SQL equivalent: WHERE clause. Example: returns all employees older than 30.
Projection
The projection operator selects a subset of attributes and removes duplicates:
where denotes the value of attribute in tuple . Since the result is a set, duplicate tuples are eliminated.
SQL equivalent: SELECT DISTINCT A1, ..., Ak. Example: returns the set of (name, department) pairs.
Union
The union of two union-compatible relations (same schema) is:
Union-compatibility requires that and have the same number of attributes with matching domains. SQL equivalent: UNION.
Set Difference
The set difference of two union-compatible relations:
SQL equivalent: EXCEPT. This is the only fundamental operation that is not monotone: adding tuples to can remove tuples from the result.
Cartesian Product
The Cartesian product concatenates every tuple in with every tuple in :
If has attributes and tuples, and has attributes and tuples, then has attributes and tuples. SQL equivalent: CROSS JOIN or listing multiple tables in FROM without a join condition.
Derived Operations
The following operations are defined in terms of the five fundamentals but are used so frequently that they have dedicated notation.
Natural Join
The natural join combines tuples from and that agree on all common attributes:
where are the attributes common to both schemas, and is the union of all attributes (without duplicating the common ones).
SQL equivalent: NATURAL JOIN or JOIN ... ON R.A = S.A.
Theta Join
The theta join is a Cartesian product followed by selection:
When uses only equality comparisons, this is an equijoin. The natural join is an equijoin on all shared attributes, followed by duplicate attribute removal.
Intersection
Equivalently, . SQL equivalent: INTERSECT.
Division
The division returns tuples in the projection of that are associated with every tuple in . If has schema and has schema :
Division answers "for all" queries: "find suppliers who supply ALL parts in ." SQL has no direct DIVIDE BY operator; this requires a double negation pattern with NOT EXISTS.
Algebraic Identities and Query Optimization
The query optimizer uses algebraic equivalences to rewrite expressions into more efficient forms. The most important identities:
Selection pushdown:
Pushing selection below a join is the single most important query optimization. If and the selection keeps 1% of rows, computing the join on rows instead of is a 100x speedup.
Projection pushdown:
where and include the join attributes plus the attributes in from each relation.
Join commutativity and associativity:
Join reordering is a combinatorial optimization problem. For tables, there are possible join orderings (Catalan numbers). The optimizer uses dynamic programming (System R algorithm) or heuristics to find a good order.
Codd's Theorem
Codd's Theorem: Equivalence of Relational Algebra and Relational Calculus
Statement
The relational algebra and the domain-independent relational calculus have exactly the same expressive power. For every relational algebra expression , there exists a domain-independent relational calculus formula such that and define the same query (produce the same result on every database instance). Conversely, for every domain-independent relational calculus formula , there exists a relational algebra expression defining the same query.
Intuition
Relational algebra is procedural: it specifies a sequence of operations to compute the result. Relational calculus is declarative: it specifies a logical condition the result must satisfy. Codd's theorem says these two approaches are equivalent. Any query you can describe ("give me all employees who work in a department with budget over 1 million USD") can be computed by a sequence of selections, projections, joins, unions, and differences, and vice versa.
This is why SQL works: SQL's SELECT ... FROM ... WHERE syntax is based on relational calculus (declarative), but the database can compile it into relational algebra (procedural) without losing or gaining expressive power.
Proof Sketch
Algebra to Calculus (easier direction): Induction on the structure of algebra expressions. Each fundamental operation has a direct calculus translation:
- becomes
- becomes
- becomes
- becomes
- becomes
Calculus to Algebra (harder direction): Induction on formula structure, translating each logical connective and quantifier into algebraic operations. The existential quantifier maps to projection. Universal quantification is rewritten as , which maps to set difference. The critical restriction to domain-independent formulas ensures the result is always finite.
Why It Matters
Codd's theorem is the theoretical foundation of the entire relational database industry. It guarantees that SQL (based on calculus) and the query execution engine (based on algebra) are equivalent. The optimizer can freely rewrite between algebraic forms because the algebra is closed under equivalence-preserving transformations. Without this theorem, there is no guarantee that optimized queries return the same results as the original.
The theorem also defines the boundary of relational expressiveness. Queries that require recursion (transitive closure, reachability in graphs) are outside relational algebra and relational calculus. This motivated extensions like Datalog and SQL's recursive CTEs.
Failure Mode
The equivalence requires domain independence: calculus formulas must produce finite results regardless of the domain of interpretation. The formula is domain-dependent because its result depends on what values exist outside . If the domain is all integers, the result is infinite. Domain independence is undecidable in general, so practical systems restrict the syntax (safe-range calculus) to guarantee it.
The equivalence also assumes set semantics (no duplicates). SQL operates on bags (multisets), which have different algebraic laws. Bag relational algebra is strictly more expressive than set relational algebra for certain aggregate queries.
Beyond Relational Algebra
Datalog and recursion. Relational algebra cannot express recursive queries. The query "find all nodes reachable from node in a graph" requires transitive closure, which needs unbounded iteration. Datalog extends relational calculus with recursive rules. SQL added recursive common table expressions (WITH RECURSIVE) to handle this.
Aggregation. The standard five operations do not include aggregation (, , ). Aggregation is added as a separate operator (gamma) in extended relational algebra: groups by attributes and applies aggregate function agg to attribute .
Relevance to ML pipelines. Feature engineering for tabular ML is a sequence of relational operations. Joining a customer table with a transactions table, filtering by date, grouping by customer and aggregating transaction counts: this is pure relational algebra. DataFrame libraries (pandas, Polars, Spark DataFrames) implement relational algebra over bags, with aggregation and window functions. Understanding the algebra helps reason about correctness (does this join produce duplicates?) and efficiency (should this filter go before or after the join?).
Common Confusions
Natural join is not Cartesian product
The Cartesian product pairs every tuple in with every tuple in . The natural join only keeps pairs that agree on shared attributes. If and share no attributes, the natural join degenerates to the Cartesian product. If they share all attributes, the natural join degenerates to intersection. The common case is between these extremes.
Projection removes duplicates (in set semantics)
In relational algebra (set semantics), always eliminates duplicate tuples. In SQL, SELECT does not remove duplicates; you need SELECT DISTINCT. This is because SQL uses bag semantics, where duplicates carry information (they represent multiplicity). The mismatch between set algebra and bag SQL is a common source of bugs.
Division is not the inverse of Cartesian product
does not undo . Given and , the division returns the set of values that are paired with ALL values in . It answers universal quantification queries, not inverse-product queries.
Key Takeaways
- A relation is a finite set of tuples over a fixed schema
- Five fundamental operations (selection, projection, union, set difference, Cartesian product) generate all relational queries
- Join, intersection, and division are derived from the five fundamentals
- Selection pushdown below joins is the single query optimization with the largest practical effect on runtime
- Codd's theorem: relational algebra = domain-independent relational calculus in expressive power
- Relational algebra cannot express recursion or aggregation; extensions are needed
- SQL operates on bags (multisets), not sets; the algebraic laws differ
Exercises
Problem
Given relations and , compute: (a) (natural join on ), (b) , and (c) .
Problem
Write a relational algebra expression for the SQL query:
SELECT DISTINCT E.name
FROM Employees E, Departments D
WHERE E.dept_id = D.dept_id AND D.budget > 1000000
Problem
Prove that intersection can be expressed using only union and set difference. Then prove that the five fundamental operations are not all independent: show that intersection can be derived from the other four, but set difference cannot be derived from selection, projection, union, Cartesian product, and intersection.
References
Canonical:
- Codd, "A Relational Model of Data for Large Shared Data Banks" (1970), the founding paper of relational databases
- Abiteboul, Hull, Vianu, Foundations of Databases (1995), Chapters 4-5, rigorous treatment of relational algebra and calculus
- Ullman, Principles of Database and Knowledge-Base Systems, Vol. 1 (1988), Chapters 3-5
Textbooks:
- Ramakrishnan & Gehrke, Database Management Systems (3rd ed., 2003), Chapters 4-5
- Garcia-Molina, Ullman, Widom, Database Systems: The Complete Book (2nd ed., 2008), Chapters 2, 5
- Silberschatz, Korth, Sudarshan, Database System Concepts (7th ed., 2019), Chapters 2, 6
Query Optimization:
- Selinger et al., "Access Path Selection in a Relational Database Management System" (1979), the System R optimizer paper
Next Topics
Building on relational foundations:
- Sets, functions, and relations: the set-theoretic foundations that relational algebra rests on
- Basic logic and proof techniques: the logical formalism behind relational calculus
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A