Skip to main content

Infrastructure

Dask Parallel Python

Dask provides parallel and out-of-core NumPy, pandas, and scikit-learn through a lazy task graph and a pluggable scheduler. Best fit: DataFrame analytics that outgrow pandas but stay in Python.

CoreTier 3Current~12 min
0

What It Is

Dask is a parallel computing library for Python built around two ideas: a lazy task graph and a pluggable scheduler. User-facing collections — dask.array (parallel NumPy), dask.dataframe (parallel pandas), dask.bag (parallel iterators), and dask.delayed (arbitrary function graphs) — record operations as a directed acyclic graph of Python tasks instead of executing immediately. Calling .compute() triggers the scheduler, which traverses the graph and executes tasks in parallel.

The collections mirror their eager counterparts deliberately. A dask.dataframe is a horizontal partitioning of pandas DataFrames; most pandas operations have a Dask equivalent that maps to per-partition pandas calls plus a combine step. A dask.array is a tiled NumPy ndarray; element-wise and reducing ops broadcast across tiles. The familiar API is the selling point — you keep your pandas code and add import dask.dataframe as dd; df = dd.read_parquet(...).

Three schedulers ship in-box. The threaded scheduler runs tasks in a thread pool on one process; it suits NumPy and Arrow workloads that release the GIL and is the default for dask.array. The multiprocessing scheduler suits Python-bound work that holds the GIL. The distributed scheduler (dask.distributed) runs across multiple machines with a Client, Scheduler, and Worker topology, plus work stealing and a real-time dashboard. The distributed scheduler is the production target even on one machine.

When You'd Use It

Dask is the right call when the data outgrew pandas (more than ~10 GB on a workstation), the work is DataFrame-shaped (groupby, join, filter, window), and the team prefers staying in Python over moving to Spark on a JVM cluster. The same holds for arrays: out-of-core image processing, geospatial rasters, and climate model output (Xarray is built on Dask) are the canonical NumPy fits. Embarrassingly parallel scikit-learn workflows use dask-ml or joblib.parallel_backend('dask') to scale across a cluster.

Where Dask loses: sort-heavy SQL on a large warehouse (Spark and DuckDB win), stateful RL and actor workloads (Ray wins because Dask has no actor primitive), and the single-node DataFrame benchmark against Polars. Modin offers a drop-in pandas replacement that uses Ray or Dask underneath.

The "Dask is just slow Spark" critique has basis: Spark's Catalyst optimizer and Tungsten code generation produce faster query plans on identical inputs. The counter is that Dask has no JVM boundary, no PyArrow serialization across the JVM bridge, and a simpler operations story. For Python-first teams with tens-to-hundreds of GB, that integration usually outweighs raw query speed.

Notable Gotchas

Watch Out

Repartitioning silently dominates Dask DataFrame cost

A dd.read_csv on many small files produces one partition per file, which can mean tens of thousands of tiny partitions. Each task has fixed overhead in the scheduler (microseconds to milliseconds); ten thousand 1 MB partitions run vastly slower than one hundred 100 MB partitions doing the same work. Call df.repartition(npartitions=...) or use partition_size="100MB" after read. The dashboard's task stream view makes this immediately visible.

Watch Out

dask.dataframe is not a complete pandas replacement

Operations that require global ordering or cross-partition state (e.g. df.set_index on an unsorted column, df.median, complex window functions, multi-column sort) trigger expensive shuffles or are not implemented. The Dask docs flag these per-method. Plan the workflow to keep partition-local operations dominant; reach for DuckDB or Spark for the truly global queries.

References

Related Topics

Last reviewed: April 18, 2026

Next Topics