Exploratory Data Analysis

Sneiderman, Robby

Methodology

Exploratory Data Analysis

The disciplined practice of looking at data before modeling: summary statistics, distributions, correlations, missing values, outliers, and class balance. You cannot model what you do not understand.

CoreTier 2StableSupporting~40 min

Prerequisites

ML Project Lifecycle Pandas and Numpy Fundamentals Train Test Split and Data Leakage

Start 8-question practice · 3 available Prereq Map

Learning position

Read this page in the graph.

methodology | layer 1 | tier 2. This page has 3 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Train-Test Split and Data Leakage

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Models learn patterns from data. If you do not know what patterns exist in your data, you cannot evaluate whether the model learned the right ones. EDA reveals data quality issues (missing values, mislabeled examples, duplicates) that will corrupt any model trained on the data. Fixing these issues through data preprocessing and feature engineering is a prerequisite to any serious modeling effort.

The most common ML failure mode: skipping EDA, training a model, getting good numbers, deploying, and discovering in production that the model learned a data artifact instead of the target concept.

Mental Model

EDA answers five questions about your dataset:

What does each feature look like individually? (marginal distributions)
How do features relate to each other? (correlations, interactions)
How does each feature relate to the target? (conditional distributions)
What is broken? (missing values, outliers, duplicates, label errors)
Is the dataset representative? (class balance, coverage of the input space)

Summary Statistics

For each numerical feature, compute: mean, median, standard deviation, min, max, and quantiles (25th, 75th). Compare mean and median: large differences indicate skewness.

For each categorical feature, compute: number of unique values, most common value and its frequency, least common value and its frequency.

Definition

Five-Number Summary

The five-number summary of a numerical variable $X$ consists of the minimum, first quartile $Q_1$ , median $Q_2$ , third quartile $Q_3$ , and maximum. The interquartile range $\text{IQR} = Q_3 - Q_1$ measures spread while being robust to outliers, unlike the standard deviation.

Visualizations

Histograms: show the marginal distribution of a single numerical feature. Use enough bins to see the shape but not so many that noise dominates. A good default is 30-50 bins.

Box plots: show the five-number summary visually. Useful for comparing distributions across categories (e.g., feature distribution per class label).

Scatter plots: show the joint distribution of two numerical features. Look for linear relationships, clusters, and outliers.

Pair plots: scatter plots for all pairs of features. Expensive for many features ( $d > 10$ ), but invaluable for small feature sets.

Correlation heatmaps: show the Pearson correlation matrix for numerical features. Highlights redundant features (high correlation) and potential predictors (high correlation with target).

Correlations

Proposition

Spurious Correlation in High Dimensions

Statement

Given $d$ features drawn independently from $\mathcal{N}(0,1)$ and a target $y$ independent of all features, the maximum absolute sample correlation between any feature and $y$ over $n$ samples satisfies:

$\mathbb{E}\left[\max_{j=1,\ldots,d} |r_j|\right] \approx \sqrt{\frac{2 \log d}{n}}$

For $d = 1000$ features and $n = 100$ samples, the expected maximum spurious correlation is approximately $\sqrt{2 \log 1000 / 100} \approx 0.37$ .

Intuition

With enough features, some will be correlated with the target by chance. The more features you check, the higher the maximum spurious correlation. This is the multiple comparisons problem applied to feature selection.

Proof Sketch

Each sample correlation $r_j$ is approximately $\mathcal{N}(0, 1/n)$ under the null hypothesis of independence. The maximum of $d$ independent $|Z_i|$ values has expectation approximately $\sqrt{2 \log d}$ . Dividing by $\sqrt{n}$ gives the stated result.

Why It Matters

A feature with $r = 0.35$ looks predictive, but if you tested 1000 features, this is expected by chance. EDA must account for the number of comparisons. Feature importance from a trained model (which considers features jointly) is more reliable than univariate correlations.

Failure Mode

This bound assumes features are independent and Gaussian. Real features are often correlated and non-Gaussian, which can produce either higher or lower maximum spurious correlations than predicted.

report a correction →

Missing Values

Missing data is not random. Common patterns:

Missing completely at random (MCAR): missingness is independent of all variables. Rare in practice.
Missing at random (MAR): missingness depends on observed variables but not on the missing value itself.
Missing not at random (MNAR): missingness depends on the missing value. Example: high-income individuals are less likely to report income.

How to investigate: compute the fraction of missing values per feature. Visualize missingness patterns (which features are missing together). Check if missingness correlates with the target.

Outlier Detection

An outlier is a data point far from the bulk of the distribution. The key question is: is this a measurement error (remove it), a rare but real event (keep it), or a sign of a different data-generating process (investigate)?

Common detection methods:

IQR method: points below $Q_1 - 1.5 \cdot \text{IQR}$ or above $Q_3 + 1.5 \cdot \text{IQR}$
Z-score: points more than 3 standard deviations from the mean (fragile with non-Gaussian data)
Isolation forests: model-based detection for high-dimensional data

Class Balance

For classification: compute the frequency of each class. A 95%/5% split requires different modeling choices (class weights, oversampling, appropriate metrics) than a 50%/50% split.

Visualize class frequencies before modeling. If the minority class has fewer than 50-100 examples, consider whether you have enough data for supervised learning on that class.

Common Confusions

Watch Out

Correlation measures linear relationships only

Pearson correlation $r = 0$ does not mean the variables are independent. It means there is no linear relationship. Two variables can have $r = 0$ and still be perfectly dependent (e.g., $Y = X^2$ with $X$ symmetric around zero). Use scatter plots, not just correlation coefficients.

Watch Out

Dropping all rows with missing values is usually wrong

If 30% of rows have missing values, dropping them discards 30% of your data and introduces selection bias if missingness is not MCAR. Imputation (mean, median, model-based) or using models that handle missing values natively (gradient boosting) is usually better.

Exercises

ExerciseCore

Problem

You have a dataset with 500 features and 200 samples. You compute Pearson correlations between each feature and the binary target. The maximum absolute correlation is 0.40. Should you trust this feature as predictive?

ExerciseCore

Problem

Your dataset has a "timestamp" column and a "user_id" column. You plan to predict whether a user will make a purchase. Describe two EDA checks you should perform before modeling.

References

Canonical:

Tukey, Exploratory Data Analysis (1977), Chapters 1-3
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 14.3

Current:

VanderPlas, Python Data Science Handbook (2023), Chapter 4
Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 5-7

Next Topics

Train-test split and data leakage: proper data splitting after EDA
Feature importance and interpretability: understanding what drives model predictions

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Train-Test Split and Data Leakagelayer 1 · tier 1
ML Project Lifecyclelayer 1 · tier 2
Pandas and NumPy Fundamentalslayer 4 · tier 3

Derived topics

1

Feature Importance and Interpretabilitylayer 2 · tier 2

Graph-backed continuations

Feature Importance and Interpretability