Class Imbalance and Resampling

Sneiderman, Robby

Methodology

Class Imbalance and Resampling

When class frequencies differ dramatically, standard accuracy is misleading. Resampling, cost-sensitive learning, and threshold tuning restore meaningful evaluation and training.

CoreTier 2StableSupporting~40 min

Prerequisites

Confusion Matrices and Classification Metrics

Start 8-question practice · 3 available Prereq Map

Learning position

Read this page in the graph.

methodology | layer 1 | tier 2. This page has 1 direct prerequisite and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Cross-Validation Theory

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Most real-world classification problems have imbalanced classes. Fraud detection: 0.1% of transactions are fraudulent. Medical diagnosis: 1% of patients have the disease. Spam filtering: 5% spam. A classifier that always predicts the majority class achieves high accuracy while being completely useless. Understanding class imbalance is necessary for building classifiers that work on the problems that matter most.

Mental Model

When class frequencies are skewed, the loss landscape is dominated by the majority class. A model learns to predict "negative" because that is correct 99% of the time. The 1% positive class contributes almost nothing to the average loss. To fix this, you either rebalance the data, reweight the loss, or change how you evaluate the model.

Formal Setup

Let $\pi_+ = P(Y = 1)$ and $\pi_- = P(Y = 0)$ be the class prior probabilities with $\pi_+ \ll \pi_-$ . A dataset of $n$ samples contains approximately $n\pi_+$ positive and $n\pi_-$ negative examples.

Definition

Class Imbalance Ratio $ρ = π_{-} / π_{+}$

The class imbalance ratio $\rho = \pi_- / \pi_+$ measures the severity of imbalance. A ratio of 100:1 means there are 100 negative examples for every positive example. Problems with $\rho > 10$ typically require special handling.

Main Theorems

Proposition

Accuracy Paradox for Imbalanced Classes

Statement

The trivial classifier $h_{\text{trivial}}(x) = 0$ for all $x$ achieves accuracy $\pi_-$ . When $\pi_- = 0.99$ , the trivial classifier has 99% accuracy. For any classifier $h$ with accuracy less than $\pi_-$ , the trivial classifier is preferred under accuracy, regardless of how well $h$ detects the minority class.

Intuition

Accuracy counts all correct predictions equally. When 99% of examples are negative, getting all negatives right gives 99% accuracy even with zero recall on positives. Accuracy is not the right metric when the classes you care about are rare.

Proof Sketch

By definition, accuracy = $P(h(X) = Y)$ . For $h_{\text{trivial}}$ : accuracy = $P(Y = 0) = \pi_-$ . Any classifier with fewer than $n\pi_-$ correct predictions on $n$ samples has lower accuracy.

Why It Matters

This is why precision, recall, and F1 exist. In imbalanced settings, you must evaluate models using metrics that account for the cost of missing the minority class.

Failure Mode

The paradox disappears when classes are balanced ( $\pi_+ \approx \pi_-$ ) or when the classification task is easy enough that even minority class examples are correctly classified by simple models.

report a correction →

Resampling Strategies

Oversampling the Minority Class

Duplicate or synthesize minority class examples to balance the training set.

Random oversampling: duplicate minority examples at random. Simple but can cause overfitting because the model sees the same examples repeatedly.

SMOTE (Synthetic Minority Over-sampling Technique): for each minority example, find its $k$ nearest minority neighbors. Create synthetic examples by interpolating between the example and a randomly chosen neighbor:

$x_{\text{new}} = x_i + \lambda(x_j - x_i), \quad \lambda \sim \text{Uniform}(0, 1)$

where $x_j$ is one of the $k$ nearest minority neighbors of $x_i$ .

SMOTE avoids exact duplication but assumes the minority class occupies convex regions of feature space. This fails when minority examples are scattered among majority examples.

Undersampling the Majority Class

Remove majority class examples to balance the training set. Fast and reduces training time, but discards potentially useful information.

Random undersampling: remove majority examples at random. With extreme imbalance ( $\rho = 1000$ ), you discard 99.9% of your majority data.

Tomek links: remove majority examples that are nearest neighbors of minority examples. This cleans the decision boundary without random information loss.

Cost-Sensitive Learning

Instead of resampling, reweight the loss function to penalize minority class errors more heavily.

Definition

Cost-Sensitive Loss $w_{+} \cdot ℓ_{+} + w_{-} \cdot ℓ_{-}$

Assign class weights $w_+ = \rho$ and $w_- = 1$ (or equivalently $w_+ = n / (2 n_+)$ and $w_- = n / (2 n_-)$ ). The weighted empirical risk is:

$\hat{R}_w(h) = \frac{1}{n}\sum_{i=1}^{n} w_{y_i} \cdot \ell(h(x_i), y_i)$

This makes each misclassified positive cost $\rho$ times more than each misclassified negative.

Cost-sensitive learning is mathematically equivalent to oversampling when using the same weight ratios, but it is computationally cheaper and does not create duplicate examples.

Threshold Tuning

Most classifiers output a score $s(x) \in [0, 1]$ and predict positive when $s(x) > \tau$ . The default threshold $\tau = 0.5$ is Bayes-optimal only when (i) the score is the calibrated posterior $P(Y = 1 \mid X)$ and (ii) the misclassification costs for false positives and false negatives are equal. Class balance is not the relevant condition: even on a heavily imbalanced dataset, $\tau = 0.5$ is correct if those two conditions hold. Conversely, on a perfectly balanced dataset, $\tau = 0.5$ is wrong if the model is miscalibrated or if the costs are asymmetric. With unequal costs $c_{FP}$ and $c_{FN}$ , the Bayes-optimal threshold for a calibrated posterior is $\tau^* = c_{FP} / (c_{FP} + c_{FN})$ .

In practice, imbalance correlates with both miscalibration (especially after resampling) and asymmetric costs (the rare class is often what you care about), which is why threshold tuning matters in imbalanced settings. Lower $\tau$ to increase recall at the cost of precision when false negatives are expensive: in fraud detection, missing a fraud may cost 100x more than a false alarm.

Choose $\tau$ by maximizing F1, or by selecting a point on the precision-recall curve that matches your cost structure.

Evaluation Under Imbalance

Precision-recall curves are more informative than ROC curves for imbalanced data. ROC curves can look optimistic because the false positive rate denominator ( $n_-$ ) is large, making even many false positives look like a small rate.

Average precision (AP) summarizes the precision-recall curve and is preferred over AUC-ROC for imbalanced problems.

Common Confusions

Watch Out

SMOTE does not create new information

SMOTE generates synthetic examples by interpolating between existing minority examples. It does not discover new regions of the feature space. If minority examples are noisy or overlap with the majority class, SMOTE amplifies this problem by generating synthetic noise.

Watch Out

Resampling the test set is wrong

Resampling is a training strategy. The test set must reflect the true class distribution to give valid performance estimates. Balancing the test set gives misleadingly optimistic metrics for minority class performance.

Watch Out

AUC-ROC can hide poor minority class performance

A model can have AUC-ROC of 0.95 while having precision of 0.05 at useful recall levels. When $\pi_+ = 0.001$ , even a small false positive rate translates to many false positives in absolute numbers. Always check precision-recall curves for imbalanced problems.

Canonical Examples

Example

Fraud detection with 0.1% positive rate

Dataset: 1,000,000 transactions, 1,000 fraudulent. A model predicting "not fraud" for everything has 99.9% accuracy. A model with 80% recall and 5% precision catches 800 frauds but flags 16,000 legitimate transactions. Whether this is acceptable depends on the cost ratio: if each missed fraud costs 10,000 USD and each false alarm costs 10 USD to investigate, the net savings are approximately 7.8 million USD compared to no detection.

Exercises

ExerciseCore

Problem

A dataset has 10,000 examples: 9,800 negative, 200 positive. You train a classifier with 97% accuracy. Construct a confusion matrix consistent with this accuracy where the classifier has 0% recall on the positive class.

ExerciseAdvanced

Problem

Prove that for binary classification with class weights $w_+$ and $w_-$ , cost-sensitive ERM with weights $w_+ = n/(2n_+)$ and $w_- = n/(2n_-)$ is equivalent to ERM on a balanced dataset created by oversampling each class to equal size.

Related Comparisons

Focal Loss vs. Cross-Entropy Loss

References

Canonical:

Chawla et al., "SMOTE: Synthetic Minority Over-sampling Technique" (2002)
He & Garcia, "Learning from Imbalanced Data" (2009), IEEE TKDE

Current:

Saito & Rehmsmeier, "The Precision-Recall Plot Is More Informative than the ROC Plot" (2015), PLoS ONE
Johnson & Khoshgoftaar, "Survey on deep learning with class imbalance" (2019)

Next Topics

Cross-validation theory: stratified cross-validation preserves class ratios in each fold
Hypothesis testing for ML: statistical tests that account for class imbalance

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Confusion Matrices and Classification Metricslayer 1 · tier 1

Derived topics

1

Cross-Validation Theorylayer 2 · tier 2

Graph-backed continuations

Cross-Validation Theory