Broadcast Joins in Distributed Compute

Sneiderman, Robby

Infrastructure

Broadcast Joins in Distributed Compute

Broadcast hash join vs sort-merge join in Spark, Snowflake, and BigQuery. When the dimension table is small enough to ship to every executor, broadcast wins; when it isn't, the driver OOMs.

CoreTier 3CurrentReference~12 min

Prereq Map

Learning position

Read this page in the graph.

infrastructure | layer 4 | tier 3. This page has 0 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Distributed Training Theory

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

What It Is

A broadcast join ships one side of a join — usually a small dimension table — to every executor in the cluster, where it is held in memory as a hash table. The large fact table is then scanned once and probed against the local copy. There is no shuffle of the large side. Compare this to a sort-merge join, which partitions both sides on the join key, sorts each partition, and merges. Sort-merge is the safe default when both inputs are large; broadcast is much faster when one side fits in memory but catastrophic when it does not.

The mechanic is the same across engines. Spark calls it broadcast hash join and gates it on spark.sql.autoBroadcastJoinThreshold (default 10 MB; commonly tuned to 8-100 MB). Snowflake's optimizer chooses broadcast vs hash-redistribute automatically based on cardinality estimates. BigQuery does the same and labels it broadcast join in the query plan. Hive originated the pattern under the name mapside join or map join, where the small table was loaded into a distributed cache and joined inside each mapper without a reduce step.

Broadcast hash join cost scales as $O(N + S)$ where $N$ is the large-side row count and $S$ is the broadcast payload size. Sort-merge scales as $O(N \log N)$ in compute and pays a full shuffle. The broadcast wins decisively when $S$ is small enough that network and memory cost stay below the shuffle cost saved.

When You'd Use It

Use a broadcast join when joining a fact table against a dimension table, lookup table, or filter set that fits in driver and executor memory. Star-schema workloads on a data warehouse are the canonical case: orders joined to product, customer, or date dimensions. ML feature pipelines that look up entity attributes from a small registry table are the same pattern. If the small side comes from an aggregation that the optimizer already knows is bounded (a LIMIT, a GROUP BY that produces few rows, a constant table), broadcast is the right plan.

Avoid broadcasting when the small side is "small relative to the cluster" but still hundreds of megabytes, when the cluster has many executors so total broadcast network traffic is high, or when the optimizer's row-count estimate is wrong because statistics are stale. In Spark, BROADCAST(small_df) and MAPJOIN(small_table) hints force the choice when the optimizer guesses wrong.

Notable Gotchas

Watch Out

Broadcasting a too-large table OOMs the driver, not the executor

The driver collects the small side before broadcasting it. A table that exceeds driver heap (e.g. 50 GB DataFrame on a 4 GB driver) crashes the entire job during collect, not at runtime on a worker. Stale Hive metastore statistics or a forgotten filter pushdown often produce this. Set spark.sql.autoBroadcastJoinThreshold = -1 to disable broadcast globally as a kill switch.

Watch Out

Skewed join keys defeat broadcast and sort-merge alike

A single hot key (one customer with 90 percent of orders) overloads one executor regardless of join strategy. Spark 3 added Adaptive Query Execution (AQE) skew handling that splits hot partitions; Snowflake spills to disk; BigQuery dynamically repartitions. Salting the key (append a random suffix) is the manual workaround when AQE does not help.

References

Apache Spark, SQL Performance Tuning: Broadcast Hash Join (https://spark.apache.org/docs/latest/sql-performance-tuning.html#join-strategy-hints-for-sql-queries).
Armbrust et al., Spark SQL: Relational Data Processing in Spark (SIGMOD 2015).
Snowflake Documentation, Query Profile: Join Operators (https://docs.snowflake.com/en/user-guide/ui-snowsight-activity-query-profile).
Google Cloud, BigQuery: Optimize Query Computation - Broadcast Joins (https://cloud.google.com/bigquery/docs/best-practices-performance-compute).
Apache Hive, Map Join Optimization (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization).

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

0

No direct prerequisites are declared; this is treated as an entry point.

Derived topics

2

Parallel Processing Fundamentalslayer 5 · tier 2
Distributed Training Theorylayer 5 · tier 3

Graph-backed continuations

Distributed Training Theory Parallel Processing Fundamentals