Pandas and NumPy Fundamentals

Sneiderman, Robby

Infrastructure

Pandas and NumPy Fundamentals

NumPy ndarray as the foundation for vectorized scientific Python; pandas DataFrame on top. The vectorization speedup, the SettingWithCopyWarning footgun, and what changed with pandas 2.x copy-on-write and Arrow backing.

CoreTier 3CurrentReference~12 min

Prereq Map

Learning position

Read this page in the graph.

infrastructure | layer 4 | tier 3. This page has 0 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Tensors and Tensor Operations

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

What It Is

NumPy provides the ndarray: a contiguous, typed, n-dimensional buffer with a strided layout, plus a library of element-wise and reducing operations implemented in C. Vectorized expressions like a * b + c dispatch to BLAS-style loops that run at C speed and avoid the Python interpreter overhead per element. The typical vectorized speedup over a Python for loop is $10\times$ to $100\times$ , depending on operation and array size.

Pandas builds the DataFrame and Series on top of NumPy (historically) or Arrow (in pandas 2.x with dtype_backend="pyarrow"). A DataFrame is a column-major table where each column is a typed array and the index gives row labels. Most DataFrame operations push down to vectorized NumPy or Arrow kernels; the slow path is .apply with a Python lambda per row, which reverts to interpreter speed and should be avoided.

The modern Python data stack has fragmented around the same ideas. Polars is a Rust-written DataFrame library with a lazy query optimizer and a multithreaded execution engine; it is typically $5\times$ to $30\times$ faster than pandas on large groupbys and joins. DuckDB is an in-process columnar SQL engine that reads pandas, Polars, Arrow, and Parquet directly with zero-copy and runs analytic SQL faster than pandas for most aggregations. Apache Arrow is the underlying columnar memory format that lets all three (pandas 2.x, Polars, DuckDB) share data without serialization.

When You'd Use It

Pandas remains the right default when the dataset fits comfortably in memory (rule of thumb: under 5-10 GB on a workstation), when the work is exploratory, and when the surrounding code is matplotlib, scikit-learn, or statsmodels — all of which expect NumPy arrays or DataFrames at their boundaries. Reach for Polars when pandas is slow on a single-node workload and the operations are groupby, join, window function, or filter-aggregate. Reach for DuckDB when the natural expression of the query is SQL and the data lives in Parquet or arrives as a DataFrame. Reach for Dask, Ray, or Spark only when the dataset does not fit on one machine.

Notable Gotchas

Watch Out

SettingWithCopyWarning is not a warning, it's an undefined-behavior alarm

A chained assignment like df[df.col > 0]['other'] = 1 first creates a filtered view (which may be a view or a copy depending on the dtypes and layout), then assigns into it. The assignment may or may not propagate back to df. Pandas raises SettingWithCopyWarning, but the result is order-dependent and version-dependent. The fix is .loc[df.col > 0, 'other'] = 1, which is unambiguous. Pandas 2.0+ with copy-on-write (pd.options.mode.copy_on_write = True) makes the semantics deterministic: every getter returns a copy-on-write view, and the chained pattern simply does nothing. Copy-on-write is the default in pandas 3.0.

Watch Out

object dtype is a Python list of pointers, not a typed array

A pandas column of strings or mixed types defaults to dtype=object, which is a NumPy array of Python object references. Operations on object dtype run at Python speed, not C speed; memory usage balloons because each string is a full PyObject. Cast to pd.StringDtype() (Arrow-backed in 2.x) or pd.CategoricalDtype for repeated values. The speedup for groupbys and joins on categorical strings is often $5\times$ to $20\times$ .

References

Harris et al., Array Programming with NumPy (Nature 585, 2020).
McKinney, pandas: a Foundational Python Library for Data Analysis and Statistics (Python for High Performance and Scientific Computing, SciPy 2011).
pandas Documentation, Copy-on-Write (https://pandas.pydata.org/docs/user_guide/copy_on_write.html).
Polars, User Guide (https://docs.pola.rs/).
DuckDB, Why DuckDB? (https://duckdb.org/why_duckdb.html).
Apache Arrow, Arrow Columnar Format Specification (https://arrow.apache.org/docs/format/Columnar.html).

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

0

No direct prerequisites are declared; this is treated as an entry point.

Derived topics

2

Tensors and Tensor Operationslayer 0A · tier 1
Exploratory Data Analysislayer 1 · tier 2

Graph-backed continuations

Tensors and Tensor Operations Exploratory Data Analysis