Hadoop and Distributed Storage

Sneiderman, Robby

Infrastructure

Hadoop and Distributed Storage

Reference for HDFS, MapReduce, and the shift to Spark plus object storage. Why Hadoop receded, where it still appears, and the Cloudera/Hortonworks consolidation history.

CoreTier 3CurrentReference~12 min

Prereq Map

Learning position

Read this page in the graph.

infrastructure | layer 4 | tier 3. This page has 0 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

What It Is

Hadoop is an open-source framework for distributed storage and batch computation, started at Yahoo in 2006 by Doug Cutting and Mike Cafarella, based on two Google papers: the Google File System (Ghemawat, Gobioff, Leung, 2003) and MapReduce (Dean, Ghemawat, 2004). The Hadoop project bundles two core components plus a resource manager:

HDFS (Hadoop Distributed File System): a write-once, read-many file system that splits files into 128 MB or 256 MB blocks, replicates each block (default 3x) across DataNodes, and tracks block locations on a central NameNode. Designed for sequential reads of large files on commodity disks.
MapReduce: a programming model that expresses a job as a map function (per-record transform) followed by a reduce function (per-key aggregation), with an automatic shuffle stage in between. The runtime schedules tasks on nodes that already hold the input blocks (data locality), retries failed tasks, and writes intermediate output to local disk.
YARN (Yet Another Resource Negotiator, added in Hadoop 2 in 2013): the cluster resource manager that lets non-MapReduce frameworks (Spark, Tez, Flink) share a Hadoop cluster.

The ecosystem grew to include Hive (SQL on MapReduce), Pig (dataflow), HBase (wide-column store), ZooKeeper (coordination), and dozens of others. Commercial consolidation: Cloudera (founded 2008) and Hortonworks (founded 2011) merged in 2019; Cloudera went private in 2021. MapR was acquired by HPE in 2019. Vendor concentration tracked the decline of on-prem big-data deployments.

When You'd Use It

For greenfield projects in 2026, almost never. Two technological shifts displaced the original Hadoop stack:

Spark replaced MapReduce. Apache Spark (Zaharia et al., AMPLab, 2010) keeps intermediate results in memory across stages instead of writing to disk after every map and reduce. For iterative workloads (machine learning, graph algorithms, multi-stage SQL) this delivers 10-100x speedups. Spark also exposes a higher-level API (DataFrames, SQL, structured streaming) that is dramatically more pleasant than raw MapReduce. By 2018, most Hadoop clusters ran Spark on YARN rather than MapReduce.

Object storage replaced HDFS. Cloud object stores (Amazon S3, Google Cloud Storage, Azure Blob Storage) provide effectively unlimited capacity, 11-nines durability, and decouple storage from compute. HDFS requires colocating storage with compute, which makes elastic scaling difficult and wastes hardware. Modern table formats (Apache Iceberg, Delta Lake, Apache Hudi) provide ACID semantics on top of object storage, replacing the role HDFS-plus-Hive-metastore once played.

Where Hadoop still appears: regulated industries (banks, telcos, government) with multi-petabyte HDFS clusters built in 2014-2017 that are too expensive to migrate; latency-insensitive batch ETL written against Hive or MapReduce that still runs because rewrite cost exceeds cluster cost; some genomics and HEP physics clusters with idiosyncratic file-access patterns.

For ML practitioners, the practical heuristic: read data from S3 / GCS / Azure Blob into a Spark or DuckDB or Polars or Daft job, do not stand up a Hadoop cluster. If a project requires touching HDFS, treat it as legacy integration.

Definition

HDFS Block Replication

HDFS stores a file by splitting it into large blocks and replicating each block across several DataNodes. The NameNode records where blocks live; compute frameworks use those locations to schedule work near the data.

Proposition

Data Locality Principle

Statement

When data lives on the same machines that execute jobs, moving compute to the data can be cheaper than moving data across the network.

Intuition

Classic Hadoop was designed for commodity clusters where cross-rack bandwidth was precious. Scheduling a mapper on a node that already held the input block avoided a network copy.

Failure Mode

In cloud object-storage architectures, compute and storage are deliberately separated. The locality argument weakens, and table formats plus fast networks replace HDFS placement as the main abstraction.

report a correction →

ExerciseCore

Problem

Why did Spark on object storage become a more common default than MapReduce on HDFS for new analytics workloads?

Notable Gotchas

Watch Out

HDFS small-files problem

Each HDFS file consumes a NameNode metadata entry of ~150 bytes regardless of file size. A directory of 10 million 1 KB files exhausts NameNode heap long before it stresses DataNode capacity. The classic mitigation is to pack small files into SequenceFile or Avro containers, or just use object storage which has no analogous limit.

Watch Out

MapReduce on a single machine is almost always slower than Pandas

The fixed cost of a MapReduce job (JVM startup, YARN scheduling, intermediate disk writes) is seconds to tens of seconds. For datasets that fit on one machine, Pandas, Polars, or DuckDB will finish before Hadoop has scheduled the first task. The "big data" framing of the late 2000s assumed datasets larger than a single machine could hold; modern hardware moved that threshold up by two orders of magnitude.

References

White, T. "Hadoop: The Definitive Guide," 4th ed., O'Reilly, 2015. Chapters 3 (HDFS) and 7 (MapReduce internals).
Dean, J. and Ghemawat, S. "MapReduce: Simplified Data Processing on Large Clusters," OSDI 2004.
Ghemawat, S., Gobioff, H., Leung, S. "The Google File System," SOSP 2003.
Zaharia, M. et al. "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing," NSDI 2012.
Apache Hadoop Documentation (https://hadoop.apache.org/docs/stable/).
Apache Iceberg Documentation (https://iceberg.apache.org/docs/latest/), for the modern object-storage table format that replaces the HDFS-plus-Hive pattern.

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

0

No direct prerequisites are declared; this is treated as an entry point.

Derived topics

0

No published topic currently declares this as a prerequisite.