Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Methodology

Class Imbalance and Resampling

When class frequencies differ dramatically, standard accuracy is misleading. Resampling, cost-sensitive learning, and threshold tuning restore meaningful evaluation and training.

CoreTier 2Stable~40 min

Why This Matters

Most real-world classification problems have imbalanced classes. Fraud detection: 0.1% of transactions are fraudulent. Medical diagnosis: 1% of patients have the disease. Spam filtering: 5% spam. A classifier that always predicts the majority class achieves high accuracy while being completely useless. Understanding class imbalance is necessary for building classifiers that work on the problems that matter most.

Mental Model

When class frequencies are skewed, the loss landscape is dominated by the majority class. A model learns to predict "negative" because that is correct 99% of the time. The 1% positive class contributes almost nothing to the average loss. To fix this, you either rebalance the data, reweight the loss, or change how you evaluate the model.

Formal Setup

Let π+=P(Y=1)\pi_+ = P(Y = 1) and π=P(Y=0)\pi_- = P(Y = 0) be the class prior probabilities with π+π\pi_+ \ll \pi_-. A dataset of nn samples contains approximately nπ+n\pi_+ positive and nπn\pi_- negative examples.

Definition

Class Imbalance Ratio

The class imbalance ratio ρ=π/π+\rho = \pi_- / \pi_+ measures the severity of imbalance. A ratio of 100:1 means there are 100 negative examples for every positive example. Problems with ρ>10\rho > 10 typically require special handling.

Main Theorems

Proposition

Accuracy Paradox for Imbalanced Classes

Statement

The trivial classifier htrivial(x)=0h_{\text{trivial}}(x) = 0 for all xx achieves accuracy π\pi_-. When π=0.99\pi_- = 0.99, the trivial classifier has 99% accuracy. For any classifier hh with accuracy less than π\pi_-, the trivial classifier is preferred under accuracy, regardless of how well hh detects the minority class.

Intuition

Accuracy counts all correct predictions equally. When 99% of examples are negative, getting all negatives right gives 99% accuracy even with zero recall on positives. Accuracy is not the right metric when the classes you care about are rare.

Proof Sketch

By definition, accuracy = P(h(X)=Y)P(h(X) = Y). For htrivialh_{\text{trivial}}: accuracy = P(Y=0)=πP(Y = 0) = \pi_-. Any classifier with fewer than nπn\pi_- correct predictions on nn samples has lower accuracy.

Why It Matters

This is why precision, recall, and F1 exist. In imbalanced settings, you must evaluate models using metrics that account for the cost of missing the minority class.

Failure Mode

The paradox disappears when classes are balanced (π+π\pi_+ \approx \pi_-) or when the classification task is easy enough that even minority class examples are correctly classified by simple models.

Resampling Strategies

Oversampling the Minority Class

Duplicate or synthesize minority class examples to balance the training set.

Random oversampling: duplicate minority examples at random. Simple but can cause overfitting because the model sees the same examples repeatedly.

SMOTE (Synthetic Minority Over-sampling Technique): for each minority example, find its kk nearest minority neighbors. Create synthetic examples by interpolating between the example and a randomly chosen neighbor:

xnew=xi+λ(xjxi),λUniform(0,1)x_{\text{new}} = x_i + \lambda(x_j - x_i), \quad \lambda \sim \text{Uniform}(0, 1)

where xjx_j is one of the kk nearest minority neighbors of xix_i.

SMOTE avoids exact duplication but assumes the minority class occupies convex regions of feature space. This fails when minority examples are scattered among majority examples.

Undersampling the Majority Class

Remove majority class examples to balance the training set. Fast and reduces training time, but discards potentially useful information.

Random undersampling: remove majority examples at random. With extreme imbalance (ρ=1000\rho = 1000), you discard 99.9% of your majority data.

Tomek links: remove majority examples that are nearest neighbors of minority examples. This cleans the decision boundary without random information loss.

Cost-Sensitive Learning

Instead of resampling, reweight the loss function to penalize minority class errors more heavily.

Definition

Cost-Sensitive Loss

Assign class weights w+=ρw_+ = \rho and w=1w_- = 1 (or equivalently w+=n/(2n+)w_+ = n / (2 n_+) and w=n/(2n)w_- = n / (2 n_-)). The weighted empirical risk is:

R^w(h)=1ni=1nwyi(h(xi),yi)\hat{R}_w(h) = \frac{1}{n}\sum_{i=1}^{n} w_{y_i} \cdot \ell(h(x_i), y_i)

This makes each misclassified positive cost ρ\rho times more than each misclassified negative.

Cost-sensitive learning is mathematically equivalent to oversampling when using the same weight ratios, but it is computationally cheaper and does not create duplicate examples.

Threshold Tuning

Most classifiers output a score s(x)[0,1]s(x) \in [0, 1] and predict positive when s(x)>τs(x) > \tau. The default threshold τ=0.5\tau = 0.5 is optimal only when classes are balanced and misclassification costs are equal.

For imbalanced data, lower τ\tau to increase recall at the cost of precision. The optimal threshold depends on the application: in fraud detection, missing a fraud (false negative) may cost 100x more than a false alarm (false positive).

Choose τ\tau by maximizing F1, or by selecting a point on the precision-recall curve that matches your cost structure.

Evaluation Under Imbalance

Precision-recall curves are more informative than ROC curves for imbalanced data. ROC curves can look optimistic because the false positive rate denominator (nn_-) is large, making even many false positives look like a small rate.

Average precision (AP) summarizes the precision-recall curve and is preferred over AUC-ROC for imbalanced problems.

Common Confusions

Watch Out

SMOTE does not create new information

SMOTE generates synthetic examples by interpolating between existing minority examples. It does not discover new regions of the feature space. If minority examples are noisy or overlap with the majority class, SMOTE amplifies this problem by generating synthetic noise.

Watch Out

Resampling the test set is wrong

Resampling is a training strategy. The test set must reflect the true class distribution to give valid performance estimates. Balancing the test set gives misleadingly optimistic metrics for minority class performance.

Watch Out

AUC-ROC can hide poor minority class performance

A model can have AUC-ROC of 0.95 while having precision of 0.05 at useful recall levels. When π+=0.001\pi_+ = 0.001, even a small false positive rate translates to many false positives in absolute numbers. Always check precision-recall curves for imbalanced problems.

Canonical Examples

Example

Fraud detection with 0.1% positive rate

Dataset: 1,000,000 transactions, 1,000 fraudulent. A model predicting "not fraud" for everything has 99.9% accuracy. A model with 80% recall and 5% precision catches 800 frauds but flags 16,000 legitimate transactions. Whether this is acceptable depends on the cost ratio: if each missed fraud costs 10,000 USD and each false alarm costs 10 USD to investigate, the net savings are approximately 7.8 million USD compared to no detection.

Exercises

ExerciseCore

Problem

A dataset has 10,000 examples: 9,800 negative, 200 positive. You train a classifier with 97% accuracy. Construct a confusion matrix consistent with this accuracy where the classifier has 0% recall on the positive class.

ExerciseAdvanced

Problem

Prove that for binary classification with class weights w+w_+ and ww_-, cost-sensitive ERM with weights w+=n/(2n+)w_+ = n/(2n_+) and w=n/(2n)w_- = n/(2n_-) is equivalent to ERM on a balanced dataset created by oversampling each class to equal size.

Related Comparisons

References

Canonical:

  • Chawla et al., "SMOTE: Synthetic Minority Over-sampling Technique" (2002)
  • He & Garcia, "Learning from Imbalanced Data" (2009), IEEE TKDE

Current:

  • Saito & Rehmsmeier, "The Precision-Recall Plot Is More Informative than the ROC Plot" (2015), PLoS ONE

  • Johnson & Khoshgoftaar, "Survey on deep learning with class imbalance" (2019)

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics