Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Data Preprocessing and Feature Engineering

Standardization, scaling, encoding, imputation, and feature selection. Why most algorithms assume centered, scaled inputs and what breaks when you skip preprocessing.

CoreTier 1Stable~45 min

Why This Matters

Raw data almost never satisfies the assumptions that ML algorithms make. Gradient-based methods assume features are on similar scales. Distance-based methods assume features contribute equally to distance. Tree methods are more robust, but still benefit from clean inputs. Skipping preprocessing is one of the most common causes of poor model performance. Preprocessing is not optional; it is part of the modeling pipeline.

Mental Model

Preprocessing transforms raw features into a form that algorithms can work with efficiently. The three main goals: (1) put features on comparable scales so no single feature dominates, (2) encode non-numeric data as numbers, and (3) handle missing values without introducing bias.

Scaling Methods

Definition

Standardization (Z-score Normalization)

Given feature values x1,,xnx_1, \ldots, x_n, standardization transforms each value to:

zi=xixˉsz_i = \frac{x_i - \bar{x}}{s}

where xˉ=1ni=1nxi\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i is the sample mean and s=1n1i=1n(xixˉ)2s = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2} is the sample standard deviation. The result has mean 0 and standard deviation 1.

Definition

Min-Max Scaling

Transform feature values to the range [0,1][0, 1]:

zi=xixminxmaxxminz_i = \frac{x_i - x_{\min}}{x_{\max} - x_{\min}}

where xminx_{\min} and xmaxx_{\max} are the observed minimum and maximum. Sensitive to outliers: a single extreme value compresses all other values into a narrow range.

When to use which. Standardization is the default for gradient-based methods (linear regression, logistic regression, neural networks, SVMs). Min-max scaling is used when features must be bounded (e.g., pixel values in [0,1] for image models). Standardization is more robust to outliers because ss absorbs some of their effect, while min-max scaling is dominated by extremes.

Log Transform

For right-skewed features (income, population, word frequency), a log transform zi=log(xi+c)z_i = \log(x_i + c) compresses the long tail and makes the distribution more symmetric. The constant cc (often 1) handles zeros. This is not cosmetic: many models perform better with approximately symmetric features because the gradient landscape becomes better conditioned.

Encoding Categorical Variables

Definition

One-Hot Encoding

For a categorical feature with KK categories, create KK binary indicator columns:

onehot(x)=ek{0,1}K\text{onehot}(x) = e_k \in \{0, 1\}^K

where eke_k is the kk-th standard basis vector. Category "red" in becomes [1,0,0][1, 0, 0].

One-hot encoding introduces K1K-1 degrees of freedom (the KK-th is linearly dependent). For high-cardinality features (K>100K > 100), one-hot encoding creates very sparse, high-dimensional representations. Alternatives: target encoding (replace category with mean of target), hashing, or learned embeddings.

Feature crosses. For two categorical features AA (with KAK_A levels) and BB (with KBK_B levels), the cross A×BA \times B creates KA×KBK_A \times K_B new binary features representing all combinations. This lets linear models capture interactions. Example: "day of week" crossed with "hour" captures that Monday-8am differs from Saturday-8am.

Missing Value Imputation

Three standard approaches:

  1. Mean/median imputation. Replace missing values with the feature mean (or median for skewed data). Simple and fast. Biased: it underestimates variance and distorts correlations between features.

  2. Model-based imputation. Train a model (e.g., k-NN, random forest) to predict missing values from observed features. Preserves correlations better than mean imputation, but adds complexity and can overfit.

  3. Indicator augmentation. Add a binary "is missing" indicator feature alongside the imputed value. This lets the model learn that missingness itself carries information (data is often not missing at random).

Feature Selection

Three categories:

Filter methods. Rank features by a univariate statistic and keep the top kk. Common statistics: Pearson correlation with the target, mutual information I(Xj;Y)I(X_j; Y), or ANOVA F-statistic. Fast, but ignores feature interactions.

Wrapper methods. Evaluate subsets of features by training and testing a model. Forward selection adds features one at a time. Backward elimination removes features one at a time. Computationally expensive: O(2p)O(2^p) subsets for pp features in the worst case.

Embedded methods. Feature selection happens during model training. L1 regularization (Lasso) drives coefficients to zero, performing automatic feature selection. The regularization parameter λ\lambda controls sparsity.

Main Theorems

Proposition

Standardization Improves Gradient Descent Conditioning

Statement

For linear regression y^=wTx\hat{y} = w^T x with MSE loss, the condition number of the Hessian H=XTX/nH = X^T X / n determines the convergence rate. If features have variances σ12,,σp2\sigma_1^2, \ldots, \sigma_p^2 and are uncorrelated, then:

κ(H)=σmax2σmin2\kappa(H) = \frac{\sigma_{\max}^2}{\sigma_{\min}^2}

After standardization (all σj=1\sigma_j = 1), κ(H)=1\kappa(H) = 1 for uncorrelated features. Gradient descent converges in one step for κ=1\kappa = 1, versus O(κlog(1/ϵ))O(\kappa \log(1/\epsilon)) steps for condition number κ\kappa.

Intuition

Unstandardized features create an elongated loss landscape. The gradient points toward the minimum along the short axis but barely moves along the long axis. Standardization makes the landscape more spherical, so the gradient points directly toward the minimum.

Proof Sketch

For linear regression with MSE, the Hessian is H=XTX/nH = X^T X / n. If columns of XX are uncorrelated with variances σj2\sigma_j^2, then HH is diagonal with entries σj2\sigma_j^2. The condition number is maxjσj2/minjσj2\max_j \sigma_j^2 / \min_j \sigma_j^2. After standardization, all diagonal entries are 1, so κ=1\kappa = 1. The convergence rate of gradient descent on a quadratic is ((κ1)/(κ+1))t((\kappa - 1)/(\kappa + 1))^t, which is zero at κ=1\kappa = 1.

Why It Matters

This explains the common advice to "always standardize your features." It is not a heuristic; it is a direct consequence of optimization theory. Features on different scales create ill-conditioned problems that gradient descent solves slowly or fails to solve at all.

Failure Mode

Standardization helps when features are uncorrelated. If features are highly correlated, the Hessian has small eigenvalues regardless of scaling, and standardization alone does not fix the conditioning. You also need decorrelation (e.g., PCA whitening) or regularization.

Common Confusions

Watch Out

Preprocessing must be fit on training data only

A common data leakage bug: computing the mean and standard deviation on the entire dataset (including test data) before splitting. The test set statistics leak into the training pipeline. Always fit preprocessing parameters (mean, std, min, max) on the training set only, then apply the same transformation to test data.

Watch Out

Tree-based models do not need feature scaling

Decision trees split on thresholds within each feature independently. The scale of a feature does not affect where the optimal split is. Random forests and gradient boosting inherit this property. However, trees still benefit from imputation and encoding of categoricals.

Watch Out

More features is not always better

Adding irrelevant features increases dimensionality without improving signal. In high dimensions, distance-based methods suffer from the curse of dimensionality (all points become equidistant). Feature selection or regularization is necessary to remove noise features.

End-to-End Preprocessing Pipeline

The order of preprocessing steps matters. A common pipeline for tabular data:

Step 1: Split first. Partition data into train/validation/test before any preprocessing. This is non-negotiable.

Step 2: Inspect and clean. On the training set only: identify outliers, check for impossible values (negative ages, dates in the future), and verify data types. Remove or cap extreme outliers. Document every cleaning decision.

Step 3: Handle missing values. On the training set: compute imputation statistics (mean, median, or fit a k-NN imputer). Apply the same imputation to validation and test sets. If missingness exceeds 50% for a feature, consider dropping it. Add binary "is missing" indicators for features where missingness may carry signal.

Step 4: Encode categoricals. For low-cardinality features (K<20K < 20): one-hot encoding. For high-cardinality features (K>100K > 100): target encoding (using only training set statistics to avoid leakage) or hashed encoding. For ordinal features (e.g., "low/medium/high"): integer encoding preserving the natural order.

Step 5: Transform numerics. Apply log transforms to right-skewed features. Then standardize all numeric features using training set mean and standard deviation. Apply the same transformation (same μ\mu and σ\sigma) to validation and test sets.

Step 6: Feature engineering. Create interaction features (polynomial features, feature crosses) if the model cannot learn interactions (e.g., linear models). For time-series features, compute rolling statistics (mean, standard deviation over a window), ensuring the window only uses past data.

Step 7: Feature selection. Remove features with near-zero variance. Remove one of each pair of highly correlated features (r>0.95|r| > 0.95). Optionally, use L1 regularization or mutual information to select the most informative features.

Example

Preprocessing pipeline for house price prediction

Raw features: square footage (numeric, right-skewed), number of bedrooms (numeric, integer), neighborhood (categorical, 45 levels), year built (numeric), has pool (binary), listing description (text).

Pipeline applied to training set:

  1. Log-transform square footage: log(sqft)\log(\text{sqft}) reduces skew from 2.3 to 0.1
  2. One-hot encode neighborhood (45 binary features)
  3. Standardize numeric features (sqft, bedrooms, year) to zero mean, unit variance
  4. Impute 3% missing "year built" values with training median (1985)
  5. Add binary "year_built_missing" indicator
  6. Create interaction: log(sqft)×bedrooms\log(\text{sqft}) \times \text{bedrooms} (captures that the value of extra bedrooms depends on house size)
  7. Extract TF-IDF features from listing description (top 100 terms)

Total: 45 (neighborhood) + 5 (numeric) + 1 (missing indicator) + 1 (interaction) + 100 (text) = 152 features from 6 raw features. The pipeline is fit on training data and applied identically to test data.

Key Takeaways

  • Standardization (zero mean, unit variance) is the default for gradient-based methods
  • Min-max scaling for bounded features; log transform for skewed features
  • One-hot encoding for categoricals; feature crosses for interactions
  • Impute missing values, and consider adding a "missing" indicator
  • Feature selection: filters are fast, wrappers are thorough, L1 regularization is embedded
  • Always fit preprocessing on training data only to avoid leakage
  • Preprocessing is not optional: it directly affects optimization convergence and model quality

Exercises

ExerciseCore

Problem

A dataset has two features: age (range 18-90) and income (range 20000-500000). You train a linear regression with gradient descent and find it converges slowly. Estimate the condition number of the Hessian and explain why standardization helps.

ExerciseAdvanced

Problem

You have a feature with 30% missing values. Compare the bias introduced by mean imputation versus median imputation when the feature distribution is right-skewed with a long tail. Which imputation method is more robust, and why?

References

Canonical:

  • Hastie, Tibshirani & Friedman, The Elements of Statistical Learning (2009), Chapter 3.4 (feature selection), Chapter 14.5 (missing data)
  • Bishop, Pattern Recognition and Machine Learning (2006), Chapter 1.2

Current:

  • Kuhn & Johnson, Feature Engineering and Selection (2019), Chapters 5-8

  • Zheng & Casari, Feature Engineering for Machine Learning (2018)

  • Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 1-28

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics