Skip to main content

Infrastructure

Pandas and NumPy Fundamentals

NumPy ndarray as the foundation for vectorized scientific Python; pandas DataFrame on top. The vectorization speedup, the SettingWithCopyWarning footgun, and what changed with pandas 2.x copy-on-write and Arrow backing.

CoreTier 3Current~12 min
0

What It Is

NumPy provides the ndarray: a contiguous, typed, n-dimensional buffer with a strided layout, plus a library of element-wise and reducing operations implemented in C. Vectorized expressions like a * b + c dispatch to BLAS-style loops that run at C speed and avoid the Python interpreter overhead per element. The typical vectorized speedup over a Python for loop is 10×10\times to 100×100\times, depending on operation and array size.

Pandas builds the DataFrame and Series on top of NumPy (historically) or Arrow (in pandas 2.x with dtype_backend="pyarrow"). A DataFrame is a column-major table where each column is a typed array and the index gives row labels. Most DataFrame operations push down to vectorized NumPy or Arrow kernels; the slow path is .apply with a Python lambda per row, which reverts to interpreter speed and should be avoided.

The modern Python data stack has fragmented around the same ideas. Polars is a Rust-written DataFrame library with a lazy query optimizer and a multithreaded execution engine; it is typically 5×5\times to 30×30\times faster than pandas on large groupbys and joins. DuckDB is an in-process columnar SQL engine that reads pandas, Polars, Arrow, and Parquet directly with zero-copy and runs analytic SQL faster than pandas for most aggregations. Apache Arrow is the underlying columnar memory format that lets all three (pandas 2.x, Polars, DuckDB) share data without serialization.

When You'd Use It

Pandas remains the right default when the dataset fits comfortably in memory (rule of thumb: under 5-10 GB on a workstation), when the work is exploratory, and when the surrounding code is matplotlib, scikit-learn, or statsmodels — all of which expect NumPy arrays or DataFrames at their boundaries. Reach for Polars when pandas is slow on a single-node workload and the operations are groupby, join, window function, or filter-aggregate. Reach for DuckDB when the natural expression of the query is SQL and the data lives in Parquet or arrives as a DataFrame. Reach for Dask, Ray, or Spark only when the dataset does not fit on one machine.

Notable Gotchas

Watch Out

SettingWithCopyWarning is not a warning, it's an undefined-behavior alarm

A chained assignment like df[df.col > 0]['other'] = 1 first creates a filtered view (which may be a view or a copy depending on the dtypes and layout), then assigns into it. The assignment may or may not propagate back to df. Pandas raises SettingWithCopyWarning, but the result is order-dependent and version-dependent. The fix is .loc[df.col > 0, 'other'] = 1, which is unambiguous. Pandas 2.0+ with copy-on-write (pd.options.mode.copy_on_write = True) makes the semantics deterministic: every getter returns a copy-on-write view, and the chained pattern simply does nothing. Copy-on-write is the default in pandas 3.0.

Watch Out

object dtype is a Python list of pointers, not a typed array

A pandas column of strings or mixed types defaults to dtype=object, which is a NumPy array of Python object references. Operations on object dtype run at Python speed, not C speed; memory usage balloons because each string is a full PyObject. Cast to pd.StringDtype() (Arrow-backed in 2.x) or pd.CategoricalDtype for repeated values. The speedup for groupbys and joins on categorical strings is often 5×5\times to 20×20\times.

References

Related Topics

Last reviewed: April 18, 2026

Next Topics