Infrastructure
Hadoop and Distributed Storage
Reference for HDFS, MapReduce, and the shift to Spark plus object storage. Why Hadoop receded, where it still appears, and the Cloudera/Hortonworks consolidation history.
What It Is
Hadoop is an open-source framework for distributed storage and batch computation, started at Yahoo in 2006 by Doug Cutting and Mike Cafarella, based on two Google papers: the Google File System (Ghemawat, Gobioff, Leung, 2003) and MapReduce (Dean, Ghemawat, 2004). The Hadoop project bundles two core components plus a resource manager:
- HDFS (Hadoop Distributed File System): a write-once, read-many file system that splits files into 128 MB or 256 MB blocks, replicates each block (default 3x) across DataNodes, and tracks block locations on a central NameNode. Designed for sequential reads of large files on commodity disks.
- MapReduce: a programming model that expresses a job as a
mapfunction (per-record transform) followed by areducefunction (per-key aggregation), with an automatic shuffle stage in between. The runtime schedules tasks on nodes that already hold the input blocks (data locality), retries failed tasks, and writes intermediate output to local disk. - YARN (Yet Another Resource Negotiator, added in Hadoop 2 in 2013): the cluster resource manager that lets non-MapReduce frameworks (Spark, Tez, Flink) share a Hadoop cluster.
The ecosystem grew to include Hive (SQL on MapReduce), Pig (dataflow), HBase (wide-column store), ZooKeeper (coordination), and dozens of others. Commercial consolidation: Cloudera (founded 2008) and Hortonworks (founded 2011) merged in 2019; Cloudera went private in 2021. MapR was acquired by HPE in 2019. Vendor concentration tracked the decline of on-prem big-data deployments.
When You'd Use It
For greenfield projects in 2026, almost never. Two technological shifts displaced the original Hadoop stack:
Spark replaced MapReduce. Apache Spark (Zaharia et al., AMPLab, 2010) keeps intermediate results in memory across stages instead of writing to disk after every map and reduce. For iterative workloads (machine learning, graph algorithms, multi-stage SQL) this delivers 10-100x speedups. Spark also exposes a higher-level API (DataFrames, SQL, structured streaming) that is dramatically more pleasant than raw MapReduce. By 2018, most Hadoop clusters ran Spark on YARN rather than MapReduce.
Object storage replaced HDFS. Cloud object stores (Amazon S3, Google Cloud Storage, Azure Blob Storage) provide effectively unlimited capacity, 11-nines durability, and decouple storage from compute. HDFS requires colocating storage with compute, which makes elastic scaling difficult and wastes hardware. Modern table formats (Apache Iceberg, Delta Lake, Apache Hudi) provide ACID semantics on top of object storage, replacing the role HDFS-plus-Hive-metastore once played.
Where Hadoop still appears: regulated industries (banks, telcos, government) with multi-petabyte HDFS clusters built in 2014-2017 that are too expensive to migrate; latency-insensitive batch ETL written against Hive or MapReduce that still runs because rewrite cost exceeds cluster cost; some genomics and HEP physics clusters with idiosyncratic file-access patterns.
For ML practitioners, the practical heuristic: read data from S3 / GCS / Azure Blob into a Spark or DuckDB or Polars or Daft job, do not stand up a Hadoop cluster. If a project requires touching HDFS, treat it as legacy integration.
HDFS Block Replication
HDFS stores a file by splitting it into large blocks and replicating each block across several DataNodes. The NameNode records where blocks live; compute frameworks use those locations to schedule work near the data.
Data Locality Principle
Statement
When data lives on the same machines that execute jobs, moving compute to the data can be cheaper than moving data across the network.
Intuition
Classic Hadoop was designed for commodity clusters where cross-rack bandwidth was precious. Scheduling a mapper on a node that already held the input block avoided a network copy.
Failure Mode
In cloud object-storage architectures, compute and storage are deliberately separated. The locality argument weakens, and table formats plus fast networks replace HDFS placement as the main abstraction.
Problem
Why did Spark on object storage become a more common default than MapReduce on HDFS for new analytics workloads?
Notable Gotchas
HDFS small-files problem
Each HDFS file consumes a NameNode metadata entry of ~150 bytes regardless of file size. A directory of 10 million 1 KB files exhausts NameNode heap long before it stresses DataNode capacity. The classic mitigation is to pack small files into SequenceFile or Avro containers, or just use object storage which has no analogous limit.
MapReduce on a single machine is almost always slower than Pandas
The fixed cost of a MapReduce job (JVM startup, YARN scheduling, intermediate disk writes) is seconds to tens of seconds. For datasets that fit on one machine, Pandas, Polars, or DuckDB will finish before Hadoop has scheduled the first task. The "big data" framing of the late 2000s assumed datasets larger than a single machine could hold; modern hardware moved that threshold up by two orders of magnitude.
References
- White, T. "Hadoop: The Definitive Guide," 4th ed., O'Reilly, 2015. Chapters 3 (HDFS) and 7 (MapReduce internals).
- Dean, J. and Ghemawat, S. "MapReduce: Simplified Data Processing on Large Clusters," OSDI 2004.
- Ghemawat, S., Gobioff, H., Leung, S. "The Google File System," SOSP 2003.
- Zaharia, M. et al. "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing," NSDI 2012.
- Apache Hadoop Documentation (https://hadoop.apache.org/docs/stable/).
- Apache Iceberg Documentation (https://iceberg.apache.org/docs/latest/), for the modern object-storage table format that replaces the HDFS-plus-Hive pattern.
Related Topics
Last reviewed: April 18, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
0No direct prerequisites are declared; this is treated as an entry point.
Derived topics
0No published topic currently declares this as a prerequisite.