Skip to main content

Infrastructure

Hadoop and Distributed Storage

Reference for HDFS, MapReduce, and the shift to Spark plus object storage. Why Hadoop receded, where it still appears, and the Cloudera/Hortonworks consolidation history.

CoreTier 3Current~12 min
0

What It Is

Hadoop is an open-source framework for distributed storage and batch computation, started at Yahoo in 2006 by Doug Cutting and Mike Cafarella, based on two Google papers: the Google File System (Ghemawat, Gobioff, Leung, 2003) and MapReduce (Dean, Ghemawat, 2004). The Hadoop project bundles two core components plus a resource manager:

  • HDFS (Hadoop Distributed File System): a write-once, read-many file system that splits files into 128 MB or 256 MB blocks, replicates each block (default 3x) across DataNodes, and tracks block locations on a central NameNode. Designed for sequential reads of large files on commodity disks.
  • MapReduce: a programming model that expresses a job as a map function (per-record transform) followed by a reduce function (per-key aggregation), with an automatic shuffle stage in between. The runtime schedules tasks on nodes that already hold the input blocks (data locality), retries failed tasks, and writes intermediate output to local disk.
  • YARN (Yet Another Resource Negotiator, added in Hadoop 2 in 2013): the cluster resource manager that lets non-MapReduce frameworks (Spark, Tez, Flink) share a Hadoop cluster.

The ecosystem grew to include Hive (SQL on MapReduce), Pig (dataflow), HBase (wide-column store), ZooKeeper (coordination), and dozens of others. Commercial consolidation: Cloudera (founded 2008) and Hortonworks (founded 2011) merged in 2019; Cloudera went private in 2021. MapR was acquired by HPE in 2019. Vendor concentration tracked the decline of on-prem big-data deployments.

When You'd Use It

For greenfield projects in 2026, almost never. Two technological shifts displaced the original Hadoop stack:

Spark replaced MapReduce. Apache Spark (Zaharia et al., AMPLab, 2010) keeps intermediate results in memory across stages instead of writing to disk after every map and reduce. For iterative workloads (machine learning, graph algorithms, multi-stage SQL) this delivers 10-100x speedups. Spark also exposes a higher-level API (DataFrames, SQL, structured streaming) that is dramatically more pleasant than raw MapReduce. By 2018, most Hadoop clusters ran Spark on YARN rather than MapReduce.

Object storage replaced HDFS. Cloud object stores (Amazon S3, Google Cloud Storage, Azure Blob Storage) provide effectively unlimited capacity, 11-nines durability, and decouple storage from compute. HDFS requires colocating storage with compute, which makes elastic scaling difficult and wastes hardware. Modern table formats (Apache Iceberg, Delta Lake, Apache Hudi) provide ACID semantics on top of object storage, replacing the role HDFS-plus-Hive-metastore once played.

Where Hadoop still appears: regulated industries (banks, telcos, government) with multi-petabyte HDFS clusters built in 2014-2017 that are too expensive to migrate; latency-insensitive batch ETL written against Hive or MapReduce that still runs because rewrite cost exceeds cluster cost; some genomics and HEP physics clusters with idiosyncratic file-access patterns.

For ML practitioners, the practical heuristic: read data from S3 / GCS / Azure Blob into a Spark or DuckDB or Polars or Daft job, do not stand up a Hadoop cluster. If a project requires touching HDFS, treat it as legacy integration.

Notable Gotchas

Watch Out

HDFS small-files problem

Each HDFS file consumes a NameNode metadata entry of ~150 bytes regardless of file size. A directory of 10 million 1 KB files exhausts NameNode heap long before it stresses DataNode capacity. The classic mitigation is to pack small files into SequenceFile or Avro containers, or just use object storage which has no analogous limit.

Watch Out

MapReduce on a single machine is almost always slower than Pandas

The fixed cost of a MapReduce job (JVM startup, YARN scheduling, intermediate disk writes) is seconds to tens of seconds. For datasets that fit on one machine, Pandas, Polars, or DuckDB will finish before Hadoop has scheduled the first task. The "big data" framing of the late 2000s assumed datasets larger than a single machine could hold; modern hardware moved that threshold up by two orders of magnitude.

References

Related Topics

Last reviewed: April 18, 2026