Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Methodology

Exploratory Data Analysis

The disciplined practice of looking at data before modeling: summary statistics, distributions, correlations, missing values, outliers, and class balance. You cannot model what you do not understand.

CoreTier 2Stable~40 min

Why This Matters

Models learn patterns from data. If you do not know what patterns exist in your data, you cannot evaluate whether the model learned the right ones. EDA reveals data quality issues (missing values, mislabeled examples, duplicates) that will corrupt any model trained on the data. Fixing these issues through data preprocessing and feature engineering is a prerequisite to any serious modeling effort.

The most common ML failure mode: skipping EDA, training a model, getting good numbers, deploying, and discovering in production that the model learned a data artifact instead of the target concept.

Mental Model

EDA answers five questions about your dataset:

  1. What does each feature look like individually? (marginal distributions)
  2. How do features relate to each other? (correlations, interactions)
  3. How does each feature relate to the target? (conditional distributions)
  4. What is broken? (missing values, outliers, duplicates, label errors)
  5. Is the dataset representative? (class balance, coverage of the input space)

Summary Statistics

For each numerical feature, compute: mean, median, standard deviation, min, max, and quantiles (25th, 75th). Compare mean and median: large differences indicate skewness.

For each categorical feature, compute: number of unique values, most common value and its frequency, least common value and its frequency.

Definition

Five-Number Summary

The five-number summary of a numerical variable XX consists of the minimum, first quartile Q1Q_1, median Q2Q_2, third quartile Q3Q_3, and maximum. The interquartile range IQR=Q3Q1\text{IQR} = Q_3 - Q_1 measures spread while being robust to outliers, unlike the standard deviation.

Visualizations

Histograms: show the marginal distribution of a single numerical feature. Use enough bins to see the shape but not so many that noise dominates. A good default is 30-50 bins.

Box plots: show the five-number summary visually. Useful for comparing distributions across categories (e.g., feature distribution per class label).

Scatter plots: show the joint distribution of two numerical features. Look for linear relationships, clusters, and outliers.

Pair plots: scatter plots for all pairs of features. Expensive for many features (d>10d > 10), but invaluable for small feature sets.

Correlation heatmaps: show the Pearson correlation matrix for numerical features. Highlights redundant features (high correlation) and potential predictors (high correlation with target).

Correlations

Proposition

Spurious Correlation in High Dimensions

Statement

Given dd features drawn independently from N(0,1)\mathcal{N}(0,1) and a target yy independent of all features, the maximum absolute sample correlation between any feature and yy over nn samples satisfies:

E[maxj=1,,drj]2logdn\mathbb{E}\left[\max_{j=1,\ldots,d} |r_j|\right] \approx \sqrt{\frac{2 \log d}{n}}

For d=1000d = 1000 features and n=100n = 100 samples, the expected maximum spurious correlation is approximately 2log1000/1000.37\sqrt{2 \log 1000 / 100} \approx 0.37.

Intuition

With enough features, some will be correlated with the target by chance. The more features you check, the higher the maximum spurious correlation. This is the multiple comparisons problem applied to feature selection.

Proof Sketch

Each sample correlation rjr_j is approximately N(0,1/n)\mathcal{N}(0, 1/n) under the null hypothesis of independence. The maximum of dd independent Zi|Z_i| values has expectation approximately 2logd\sqrt{2 \log d}. Dividing by n\sqrt{n} gives the stated result.

Why It Matters

A feature with r=0.35r = 0.35 looks predictive, but if you tested 1000 features, this is expected by chance. EDA must account for the number of comparisons. Feature importance from a trained model (which considers features jointly) is more reliable than univariate correlations.

Failure Mode

This bound assumes features are independent and Gaussian. Real features are often correlated and non-Gaussian, which can produce either higher or lower maximum spurious correlations than predicted.

Missing Values

Missing data is not random. Common patterns:

  • Missing completely at random (MCAR): missingness is independent of all variables. Rare in practice.
  • Missing at random (MAR): missingness depends on observed variables but not on the missing value itself.
  • Missing not at random (MNAR): missingness depends on the missing value. Example: high-income individuals are less likely to report income.

How to investigate: compute the fraction of missing values per feature. Visualize missingness patterns (which features are missing together). Check if missingness correlates with the target.

Outlier Detection

An outlier is a data point far from the bulk of the distribution. The key question is: is this a measurement error (remove it), a rare but real event (keep it), or a sign of a different data-generating process (investigate)?

Common detection methods:

  • IQR method: points below Q11.5IQRQ_1 - 1.5 \cdot \text{IQR} or above Q3+1.5IQRQ_3 + 1.5 \cdot \text{IQR}
  • Z-score: points more than 3 standard deviations from the mean (fragile with non-Gaussian data)
  • Isolation forests: model-based detection for high-dimensional data

Class Balance

For classification: compute the frequency of each class. A 95%/5% split requires different modeling choices (class weights, oversampling, appropriate metrics) than a 50%/50% split.

Visualize class frequencies before modeling. If the minority class has fewer than 50-100 examples, consider whether you have enough data for supervised learning on that class.

Common Confusions

Watch Out

Correlation measures linear relationships only

Pearson correlation r=0r = 0 does not mean the variables are independent. It means there is no linear relationship. Two variables can have r=0r = 0 and still be perfectly dependent (e.g., Y=X2Y = X^2 with XX symmetric around zero). Use scatter plots, not just correlation coefficients.

Watch Out

Dropping all rows with missing values is usually wrong

If 30% of rows have missing values, dropping them discards 30% of your data and introduces selection bias if missingness is not MCAR. Imputation (mean, median, model-based) or using models that handle missing values natively (gradient boosting) is usually better.

Exercises

ExerciseCore

Problem

You have a dataset with 500 features and 200 samples. You compute Pearson correlations between each feature and the binary target. The maximum absolute correlation is 0.40. Should you trust this feature as predictive?

ExerciseCore

Problem

Your dataset has a "timestamp" column and a "user_id" column. You plan to predict whether a user will make a purchase. Describe two EDA checks you should perform before modeling.

References

Canonical:

  • Tukey, Exploratory Data Analysis (1977), Chapters 1-3
  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 14.3

Current:

  • VanderPlas, Python Data Science Handbook (2023), Chapter 4

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14

  • Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 5-7

Next Topics

Last reviewed: April 2026

Next Topics