Methodology
Exploratory Data Analysis
The disciplined practice of looking at data before modeling: summary statistics, distributions, correlations, missing values, outliers, and class balance. You cannot model what you do not understand.
Why This Matters
Models learn patterns from data. If you do not know what patterns exist in your data, you cannot evaluate whether the model learned the right ones. EDA reveals data quality issues (missing values, mislabeled examples, duplicates) that will corrupt any model trained on the data. Fixing these issues through data preprocessing and feature engineering is a prerequisite to any serious modeling effort.
The most common ML failure mode: skipping EDA, training a model, getting good numbers, deploying, and discovering in production that the model learned a data artifact instead of the target concept.
Mental Model
EDA answers five questions about your dataset:
- What does each feature look like individually? (marginal distributions)
- How do features relate to each other? (correlations, interactions)
- How does each feature relate to the target? (conditional distributions)
- What is broken? (missing values, outliers, duplicates, label errors)
- Is the dataset representative? (class balance, coverage of the input space)
Summary Statistics
For each numerical feature, compute: mean, median, standard deviation, min, max, and quantiles (25th, 75th). Compare mean and median: large differences indicate skewness.
For each categorical feature, compute: number of unique values, most common value and its frequency, least common value and its frequency.
Five-Number Summary
The five-number summary of a numerical variable consists of the minimum, first quartile , median , third quartile , and maximum. The interquartile range measures spread while being robust to outliers, unlike the standard deviation.
Visualizations
Histograms: show the marginal distribution of a single numerical feature. Use enough bins to see the shape but not so many that noise dominates. A good default is 30-50 bins.
Box plots: show the five-number summary visually. Useful for comparing distributions across categories (e.g., feature distribution per class label).
Scatter plots: show the joint distribution of two numerical features. Look for linear relationships, clusters, and outliers.
Pair plots: scatter plots for all pairs of features. Expensive for many features (), but invaluable for small feature sets.
Correlation heatmaps: show the Pearson correlation matrix for numerical features. Highlights redundant features (high correlation) and potential predictors (high correlation with target).
Correlations
Spurious Correlation in High Dimensions
Statement
Given features drawn independently from and a target independent of all features, the maximum absolute sample correlation between any feature and over samples satisfies:
For features and samples, the expected maximum spurious correlation is approximately .
Intuition
With enough features, some will be correlated with the target by chance. The more features you check, the higher the maximum spurious correlation. This is the multiple comparisons problem applied to feature selection.
Proof Sketch
Each sample correlation is approximately under the null hypothesis of independence. The maximum of independent values has expectation approximately . Dividing by gives the stated result.
Why It Matters
A feature with looks predictive, but if you tested 1000 features, this is expected by chance. EDA must account for the number of comparisons. Feature importance from a trained model (which considers features jointly) is more reliable than univariate correlations.
Failure Mode
This bound assumes features are independent and Gaussian. Real features are often correlated and non-Gaussian, which can produce either higher or lower maximum spurious correlations than predicted.
Missing Values
Missing data is not random. Common patterns:
- Missing completely at random (MCAR): missingness is independent of all variables. Rare in practice.
- Missing at random (MAR): missingness depends on observed variables but not on the missing value itself.
- Missing not at random (MNAR): missingness depends on the missing value. Example: high-income individuals are less likely to report income.
How to investigate: compute the fraction of missing values per feature. Visualize missingness patterns (which features are missing together). Check if missingness correlates with the target.
Outlier Detection
An outlier is a data point far from the bulk of the distribution. The key question is: is this a measurement error (remove it), a rare but real event (keep it), or a sign of a different data-generating process (investigate)?
Common detection methods:
- IQR method: points below or above
- Z-score: points more than 3 standard deviations from the mean (fragile with non-Gaussian data)
- Isolation forests: model-based detection for high-dimensional data
Class Balance
For classification: compute the frequency of each class. A 95%/5% split requires different modeling choices (class weights, oversampling, appropriate metrics) than a 50%/50% split.
Visualize class frequencies before modeling. If the minority class has fewer than 50-100 examples, consider whether you have enough data for supervised learning on that class.
Common Confusions
Correlation measures linear relationships only
Pearson correlation does not mean the variables are independent. It means there is no linear relationship. Two variables can have and still be perfectly dependent (e.g., with symmetric around zero). Use scatter plots, not just correlation coefficients.
Dropping all rows with missing values is usually wrong
If 30% of rows have missing values, dropping them discards 30% of your data and introduces selection bias if missingness is not MCAR. Imputation (mean, median, model-based) or using models that handle missing values natively (gradient boosting) is usually better.
Exercises
Problem
You have a dataset with 500 features and 200 samples. You compute Pearson correlations between each feature and the binary target. The maximum absolute correlation is 0.40. Should you trust this feature as predictive?
Problem
Your dataset has a "timestamp" column and a "user_id" column. You plan to predict whether a user will make a purchase. Describe two EDA checks you should perform before modeling.
References
Canonical:
- Tukey, Exploratory Data Analysis (1977), Chapters 1-3
- Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 14.3
Current:
-
VanderPlas, Python Data Science Handbook (2023), Chapter 4
-
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14
-
Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 5-7
Next Topics
- Train-test split and data leakage: proper data splitting after EDA
- Feature importance and interpretability: understanding what drives model predictions
Last reviewed: April 2026