ML Methods
Data Preprocessing and Feature Engineering
Standardization, scaling, encoding, imputation, and feature selection. Why most algorithms assume centered, scaled inputs and what breaks when you skip preprocessing.
Prerequisites
Why This Matters
Raw data almost never satisfies the assumptions that ML algorithms make. Gradient-based methods assume features are on similar scales. Distance-based methods assume features contribute equally to distance. Tree methods are more robust, but still benefit from clean inputs. Skipping preprocessing is one of the most common causes of poor model performance. Preprocessing is not optional; it is part of the modeling pipeline.
Mental Model
Preprocessing transforms raw features into a form that algorithms can work with efficiently. The three main goals: (1) put features on comparable scales so no single feature dominates, (2) encode non-numeric data as numbers, and (3) handle missing values without introducing bias.
Scaling Methods
Standardization (Z-score Normalization)
Given feature values , standardization transforms each value to:
where is the sample mean and is the sample standard deviation. The result has mean 0 and standard deviation 1.
Min-Max Scaling
Transform feature values to the range :
where and are the observed minimum and maximum. Sensitive to outliers: a single extreme value compresses all other values into a narrow range.
When to use which. Standardization is the default for gradient-based methods (linear regression, logistic regression, neural networks, SVMs). Min-max scaling is used when features must be bounded (e.g., pixel values in [0,1] for image models). Standardization is more robust to outliers because absorbs some of their effect, while min-max scaling is dominated by extremes.
Log Transform
For right-skewed features (income, population, word frequency), a log transform compresses the long tail and makes the distribution more symmetric. The constant (often 1) handles zeros. This is not cosmetic: many models perform better with approximately symmetric features because the gradient landscape becomes better conditioned.
Encoding Categorical Variables
One-Hot Encoding
For a categorical feature with categories, create binary indicator columns:
where is the -th standard basis vector. Category "red" in becomes .
One-hot encoding introduces degrees of freedom (the -th is linearly dependent). For high-cardinality features (), one-hot encoding creates very sparse, high-dimensional representations. Alternatives: target encoding (replace category with mean of target), hashing, or learned embeddings.
Feature crosses. For two categorical features (with levels) and (with levels), the cross creates new binary features representing all combinations. This lets linear models capture interactions. Example: "day of week" crossed with "hour" captures that Monday-8am differs from Saturday-8am.
Missing Value Imputation
Three standard approaches:
-
Mean/median imputation. Replace missing values with the feature mean (or median for skewed data). Simple and fast. Biased: it underestimates variance and distorts correlations between features.
-
Model-based imputation. Train a model (e.g., k-NN, random forest) to predict missing values from observed features. Preserves correlations better than mean imputation, but adds complexity and can overfit.
-
Indicator augmentation. Add a binary "is missing" indicator feature alongside the imputed value. This lets the model learn that missingness itself carries information (data is often not missing at random).
Feature Selection
Three categories:
Filter methods. Rank features by a univariate statistic and keep the top . Common statistics: Pearson correlation with the target, mutual information , or ANOVA F-statistic. Fast, but ignores feature interactions.
Wrapper methods. Evaluate subsets of features by training and testing a model. Forward selection adds features one at a time. Backward elimination removes features one at a time. Computationally expensive: subsets for features in the worst case.
Embedded methods. Feature selection happens during model training. L1 regularization (Lasso) drives coefficients to zero, performing automatic feature selection. The regularization parameter controls sparsity.
Main Theorems
Standardization Improves Gradient Descent Conditioning
Statement
For linear regression with MSE loss, the condition number of the Hessian determines the convergence rate. If features have variances and are uncorrelated, then:
After standardization (all ), for uncorrelated features. Gradient descent converges in one step for , versus steps for condition number .
Intuition
Unstandardized features create an elongated loss landscape. The gradient points toward the minimum along the short axis but barely moves along the long axis. Standardization makes the landscape more spherical, so the gradient points directly toward the minimum.
Proof Sketch
For linear regression with MSE, the Hessian is . If columns of are uncorrelated with variances , then is diagonal with entries . The condition number is . After standardization, all diagonal entries are 1, so . The convergence rate of gradient descent on a quadratic is , which is zero at .
Why It Matters
This explains the common advice to "always standardize your features." It is not a heuristic; it is a direct consequence of optimization theory. Features on different scales create ill-conditioned problems that gradient descent solves slowly or fails to solve at all.
Failure Mode
Standardization helps when features are uncorrelated. If features are highly correlated, the Hessian has small eigenvalues regardless of scaling, and standardization alone does not fix the conditioning. You also need decorrelation (e.g., PCA whitening) or regularization.
Common Confusions
Preprocessing must be fit on training data only
A common data leakage bug: computing the mean and standard deviation on the entire dataset (including test data) before splitting. The test set statistics leak into the training pipeline. Always fit preprocessing parameters (mean, std, min, max) on the training set only, then apply the same transformation to test data.
Tree-based models do not need feature scaling
Decision trees split on thresholds within each feature independently. The scale of a feature does not affect where the optimal split is. Random forests and gradient boosting inherit this property. However, trees still benefit from imputation and encoding of categoricals.
More features is not always better
Adding irrelevant features increases dimensionality without improving signal. In high dimensions, distance-based methods suffer from the curse of dimensionality (all points become equidistant). Feature selection or regularization is necessary to remove noise features.
End-to-End Preprocessing Pipeline
The order of preprocessing steps matters. A common pipeline for tabular data:
Step 1: Split first. Partition data into train/validation/test before any preprocessing. This is non-negotiable.
Step 2: Inspect and clean. On the training set only: identify outliers, check for impossible values (negative ages, dates in the future), and verify data types. Remove or cap extreme outliers. Document every cleaning decision.
Step 3: Handle missing values. On the training set: compute imputation statistics (mean, median, or fit a k-NN imputer). Apply the same imputation to validation and test sets. If missingness exceeds 50% for a feature, consider dropping it. Add binary "is missing" indicators for features where missingness may carry signal.
Step 4: Encode categoricals. For low-cardinality features (): one-hot encoding. For high-cardinality features (): target encoding (using only training set statistics to avoid leakage) or hashed encoding. For ordinal features (e.g., "low/medium/high"): integer encoding preserving the natural order.
Step 5: Transform numerics. Apply log transforms to right-skewed features. Then standardize all numeric features using training set mean and standard deviation. Apply the same transformation (same and ) to validation and test sets.
Step 6: Feature engineering. Create interaction features (polynomial features, feature crosses) if the model cannot learn interactions (e.g., linear models). For time-series features, compute rolling statistics (mean, standard deviation over a window), ensuring the window only uses past data.
Step 7: Feature selection. Remove features with near-zero variance. Remove one of each pair of highly correlated features (). Optionally, use L1 regularization or mutual information to select the most informative features.
Preprocessing pipeline for house price prediction
Raw features: square footage (numeric, right-skewed), number of bedrooms (numeric, integer), neighborhood (categorical, 45 levels), year built (numeric), has pool (binary), listing description (text).
Pipeline applied to training set:
- Log-transform square footage: reduces skew from 2.3 to 0.1
- One-hot encode neighborhood (45 binary features)
- Standardize numeric features (sqft, bedrooms, year) to zero mean, unit variance
- Impute 3% missing "year built" values with training median (1985)
- Add binary "year_built_missing" indicator
- Create interaction: (captures that the value of extra bedrooms depends on house size)
- Extract TF-IDF features from listing description (top 100 terms)
Total: 45 (neighborhood) + 5 (numeric) + 1 (missing indicator) + 1 (interaction) + 100 (text) = 152 features from 6 raw features. The pipeline is fit on training data and applied identically to test data.
Key Takeaways
- Standardization (zero mean, unit variance) is the default for gradient-based methods
- Min-max scaling for bounded features; log transform for skewed features
- One-hot encoding for categoricals; feature crosses for interactions
- Impute missing values, and consider adding a "missing" indicator
- Feature selection: filters are fast, wrappers are thorough, L1 regularization is embedded
- Always fit preprocessing on training data only to avoid leakage
- Preprocessing is not optional: it directly affects optimization convergence and model quality
Exercises
Problem
A dataset has two features: age (range 18-90) and income (range 20000-500000). You train a linear regression with gradient descent and find it converges slowly. Estimate the condition number of the Hessian and explain why standardization helps.
Problem
You have a feature with 30% missing values. Compare the bias introduced by mean imputation versus median imputation when the feature distribution is right-skewed with a long tail. Which imputation method is more robust, and why?
References
Canonical:
- Hastie, Tibshirani & Friedman, The Elements of Statistical Learning (2009), Chapter 3.4 (feature selection), Chapter 14.5 (missing data)
- Bishop, Pattern Recognition and Machine Learning (2006), Chapter 1.2
Current:
-
Kuhn & Johnson, Feature Engineering and Selection (2019), Chapters 5-8
-
Zheng & Casari, Feature Engineering for Machine Learning (2018)
-
Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 1-28
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Linear RegressionLayer 1
- Matrix Operations and PropertiesLayer 0A
- Maximum Likelihood EstimationLayer 0B
- Differentiation in RnLayer 0A