Methodology
Class Imbalance and Resampling
When class frequencies differ dramatically, standard accuracy is misleading. Resampling, cost-sensitive learning, and threshold tuning restore meaningful evaluation and training.
Prerequisites
Why This Matters
Most real-world classification problems have imbalanced classes. Fraud detection: 0.1% of transactions are fraudulent. Medical diagnosis: 1% of patients have the disease. Spam filtering: 5% spam. A classifier that always predicts the majority class achieves high accuracy while being completely useless. Understanding class imbalance is necessary for building classifiers that work on the problems that matter most.
Mental Model
When class frequencies are skewed, the loss landscape is dominated by the majority class. A model learns to predict "negative" because that is correct 99% of the time. The 1% positive class contributes almost nothing to the average loss. To fix this, you either rebalance the data, reweight the loss, or change how you evaluate the model.
Formal Setup
Let and be the class prior probabilities with . A dataset of samples contains approximately positive and negative examples.
Class Imbalance Ratio
The class imbalance ratio measures the severity of imbalance. A ratio of 100:1 means there are 100 negative examples for every positive example. Problems with typically require special handling.
Main Theorems
Accuracy Paradox for Imbalanced Classes
Statement
The trivial classifier for all achieves accuracy . When , the trivial classifier has 99% accuracy. For any classifier with accuracy less than , the trivial classifier is preferred under accuracy, regardless of how well detects the minority class.
Intuition
Accuracy counts all correct predictions equally. When 99% of examples are negative, getting all negatives right gives 99% accuracy even with zero recall on positives. Accuracy is not the right metric when the classes you care about are rare.
Proof Sketch
By definition, accuracy = . For : accuracy = . Any classifier with fewer than correct predictions on samples has lower accuracy.
Why It Matters
This is why precision, recall, and F1 exist. In imbalanced settings, you must evaluate models using metrics that account for the cost of missing the minority class.
Failure Mode
The paradox disappears when classes are balanced () or when the classification task is easy enough that even minority class examples are correctly classified by simple models.
Resampling Strategies
Oversampling the Minority Class
Duplicate or synthesize minority class examples to balance the training set.
Random oversampling: duplicate minority examples at random. Simple but can cause overfitting because the model sees the same examples repeatedly.
SMOTE (Synthetic Minority Over-sampling Technique): for each minority example, find its nearest minority neighbors. Create synthetic examples by interpolating between the example and a randomly chosen neighbor:
where is one of the nearest minority neighbors of .
SMOTE avoids exact duplication but assumes the minority class occupies convex regions of feature space. This fails when minority examples are scattered among majority examples.
Undersampling the Majority Class
Remove majority class examples to balance the training set. Fast and reduces training time, but discards potentially useful information.
Random undersampling: remove majority examples at random. With extreme imbalance (), you discard 99.9% of your majority data.
Tomek links: remove majority examples that are nearest neighbors of minority examples. This cleans the decision boundary without random information loss.
Cost-Sensitive Learning
Instead of resampling, reweight the loss function to penalize minority class errors more heavily.
Cost-Sensitive Loss
Assign class weights and (or equivalently and ). The weighted empirical risk is:
This makes each misclassified positive cost times more than each misclassified negative.
Cost-sensitive learning is mathematically equivalent to oversampling when using the same weight ratios, but it is computationally cheaper and does not create duplicate examples.
Threshold Tuning
Most classifiers output a score and predict positive when . The default threshold is optimal only when classes are balanced and misclassification costs are equal.
For imbalanced data, lower to increase recall at the cost of precision. The optimal threshold depends on the application: in fraud detection, missing a fraud (false negative) may cost 100x more than a false alarm (false positive).
Choose by maximizing F1, or by selecting a point on the precision-recall curve that matches your cost structure.
Evaluation Under Imbalance
Precision-recall curves are more informative than ROC curves for imbalanced data. ROC curves can look optimistic because the false positive rate denominator () is large, making even many false positives look like a small rate.
Average precision (AP) summarizes the precision-recall curve and is preferred over AUC-ROC for imbalanced problems.
Common Confusions
SMOTE does not create new information
SMOTE generates synthetic examples by interpolating between existing minority examples. It does not discover new regions of the feature space. If minority examples are noisy or overlap with the majority class, SMOTE amplifies this problem by generating synthetic noise.
Resampling the test set is wrong
Resampling is a training strategy. The test set must reflect the true class distribution to give valid performance estimates. Balancing the test set gives misleadingly optimistic metrics for minority class performance.
AUC-ROC can hide poor minority class performance
A model can have AUC-ROC of 0.95 while having precision of 0.05 at useful recall levels. When , even a small false positive rate translates to many false positives in absolute numbers. Always check precision-recall curves for imbalanced problems.
Canonical Examples
Fraud detection with 0.1% positive rate
Dataset: 1,000,000 transactions, 1,000 fraudulent. A model predicting "not fraud" for everything has 99.9% accuracy. A model with 80% recall and 5% precision catches 800 frauds but flags 16,000 legitimate transactions. Whether this is acceptable depends on the cost ratio: if each missed fraud costs 10,000 USD and each false alarm costs 10 USD to investigate, the net savings are approximately 7.8 million USD compared to no detection.
Exercises
Problem
A dataset has 10,000 examples: 9,800 negative, 200 positive. You train a classifier with 97% accuracy. Construct a confusion matrix consistent with this accuracy where the classifier has 0% recall on the positive class.
Problem
Prove that for binary classification with class weights and , cost-sensitive ERM with weights and is equivalent to ERM on a balanced dataset created by oversampling each class to equal size.
Related Comparisons
References
Canonical:
- Chawla et al., "SMOTE: Synthetic Minority Over-sampling Technique" (2002)
- He & Garcia, "Learning from Imbalanced Data" (2009), IEEE TKDE
Current:
-
Saito & Rehmsmeier, "The Precision-Recall Plot Is More Informative than the ROC Plot" (2015), PLoS ONE
-
Johnson & Khoshgoftaar, "Survey on deep learning with class imbalance" (2019)
-
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8
-
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14
Next Topics
- Cross-validation theory: stratified cross-validation preserves class ratios in each fold
- Hypothesis testing for ML: statistical tests that account for class imbalance
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.