Focal Loss vs Cross-Entropy: Handling Class Imbalance in Classification

What Each Does

Both are classification losses. They differ in how they weight contributions from easy versus hard examples.

Cross-entropy loss for a binary classification with true label $y \in \{0, 1\}$ and predicted probability $p \in [0, 1]$ :

$\text{CE}(p, y) = -y \log(p) - (1 - y)\log(1 - p)$

Define $p_t = p$ if $y = 1$ and $p_t = 1 - p$ if $y = 0$ . Then $\text{CE}(p_t) = -\log(p_t)$ . Every example contributes to the loss proportionally to $-\log(p_t)$ . An example classified with $p_t = 0.9$ contributes $-\log(0.9) \approx 0.105$ . An example with $p_t = 0.1$ contributes $-\log(0.1) \approx 2.303$ .

Focal loss (Lin et al., 2017) adds a modulating factor:

$\text{FL}(p_t) = -(1 - p_t)^\gamma \log(p_t)$

where $\gamma \geq 0$ is the focusing parameter. The factor $(1 - p_t)^\gamma$ is close to 1 when $p_t$ is small (hard example) and close to 0 when $p_t$ is large (easy example). This downweights the contribution of well-classified examples.

At $\gamma = 0$ , focal loss reduces to cross-entropy. At $\gamma = 2$ (the standard setting), an example classified with $p_t = 0.9$ has its loss reduced by a factor of $(1 - 0.9)^2 = 0.01$ . Its contribution drops from 0.105 to 0.00105. An example with $p_t = 0.1$ has its loss scaled by $(1 - 0.1)^2 = 0.81$ , a modest reduction from 2.303 to 1.865.

The Class Imbalance Problem

In class-imbalanced problems, the vast majority of examples belong to one class. In object detection, a single image may contain 100,000 candidate anchor boxes but only 10 objects. The ratio of negatives to positives can be 10,000:1.

With cross-entropy, each easy negative contributes a small but nonzero loss. Summed over tens of thousands of easy negatives, these small contributions dominate the total loss and its gradient. The model spends most of its gradient budget on examples it already classifies correctly rather than on the rare, informative hard examples.

The class-balanced variant of cross-entropy reweights by inverse class frequency:

$\text{CB-CE}(p_t) = -\alpha_t \log(p_t)$

where $\alpha_t$ is the inverse frequency weight for the class. This addresses the imbalance in example count but not the imbalance between easy and hard examples. A class-balanced loss still assigns the same weight to an easy positive and a hard positive.

Focal loss addresses both: it reduces the contribution of easy examples regardless of class, and can be combined with class-balanced weights: $\text{FL}(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t)$ .

Side-by-Side Comparison

Property	Cross-Entropy	Focal Loss
Formula	$-\log(p_t)$	$-(1 - p_t)^\gamma \log(p_t)$
Weighting	Uniform across examples	Downweights easy, upweights hard
Focusing parameter	None ( $\gamma = 0$ equivalent)	$\gamma$ (typically 2)
Effect at $p_t = 0.9$	Loss = 0.105	Loss = 0.00105 ( $\gamma = 2$ )
Effect at $p_t = 0.5$	Loss = 0.693	Loss = 0.173 ( $\gamma = 2$ )
Effect at $p_t = 0.1$	Loss = 2.303	Loss = 1.865 ( $\gamma = 2$ )
Class imbalance handling	None (or via $\alpha$ weighting)	Built-in via example difficulty weighting
Extra hyperparameters	None	$\gamma$ (and optionally $\alpha$ )
Gradient behavior	Proportional to $1/p_t$	Proportional to $(1-p_t)^\gamma / p_t$
Best setting	Balanced classes	Severe imbalance (1:100+)
Original application	General classification	Dense object detection (RetinaNet)
Computational overhead	Baseline	Negligible (one extra multiply)

When Each Wins

Cross-entropy wins: balanced or mildly imbalanced data

For problems where classes are roughly balanced or the imbalance is moderate (up to 1:10), standard cross-entropy with no reweighting is sufficient. The gradient contribution from each class is roughly proportional to its representation, and the model sees enough hard examples from both classes. Adding focal loss in this regime provides little benefit and introduces an extra hyperparameter.

Cross-entropy wins: when hard examples are noisy

Focal loss upweights hard examples. If hard examples are primarily noisy or mislabeled rather than genuinely informative, focal loss amplifies the noise. In datasets with substantial label noise, cross-entropy's uniform weighting is more robust because it does not disproportionately trust the hardest examples.

Focal loss wins: severe imbalance in dense prediction

In object detection, semantic segmentation, and other dense prediction tasks, the imbalance between positive and negative regions is extreme. RetinaNet demonstrated that focal loss with $\gamma = 2$ and $\alpha = 0.25$ matched two-stage detector performance (which handles imbalance through proposal filtering) using a simpler one-stage architecture. Without focal loss, one-stage detectors were significantly worse.

Focal loss wins: when easy examples dominate the gradient

Any setting where the majority of examples are confidently classified benefits from focal loss. This includes information retrieval (most candidates are clearly irrelevant), medical imaging (most tissue is normal), and fraud detection (most transactions are legitimate). The key diagnostic: if the loss is decreasing but driven almost entirely by easy examples, focal loss redirects optimization toward the informative boundary cases.

The Gradient Perspective

The gradient of cross-entropy with respect to the logit $z$ (where $p = \sigma(z)$ ) is:

$\frac{\partial \text{CE}}{\partial z} = p_t - 1$

The gradient of focal loss is:

$\frac{\partial \text{FL}}{\partial z} = (1 - p_t)^\gamma (p_t - 1) + \gamma (1 - p_t)^{\gamma - 1} p_t \log(p_t) (p_t - 1)$

For well-classified examples ( $p_t \to 1$ ), the cross-entropy gradient approaches 0 linearly, while the focal gradient approaches 0 as $(1 - p_t)^{\gamma + 1}$ , much faster. This means the optimizer receives almost no signal from easy examples, concentrating updates on the hard cases near the decision boundary.

Common Confusions

Watch Out

Focal loss is not only for object detection

The original paper (Lin et al., 2017) introduced focal loss for RetinaNet, but the technique applies to any classification problem with severe imbalance. It has been successfully applied to medical imaging, NLP classification, fraud detection, and other domains. The mechanism (downweight easy examples) is task-agnostic.

Watch Out

Focal loss does not replace class balancing

Focal loss reweights by example difficulty, not by class frequency. In practice, combining focal loss ( $\gamma = 2$ ) with class-balanced weights ( $\alpha$ ) gives the best results. Using focal loss alone without $\alpha$ still leaves the gradient dominated by the majority class; it just focuses within each class on hard examples.

Watch Out

The focusing parameter is not always 2

$\gamma = 2$ was optimal for the RetinaNet experiments on COCO. Other datasets and tasks may benefit from different values. Lower $\gamma$ (e.g., 1) provides less focusing and is better when hard examples are noisy. Higher $\gamma$ (e.g., 5) provides more aggressive focusing and may help when the imbalance is extreme and labels are clean. Treat $\gamma$ as a hyperparameter to tune.

Watch Out

Focal loss does not solve all imbalance problems

Focal loss addresses the gradient domination problem from easy examples. It does not address the sampling problem: the model may still rarely see hard positives during training. For extreme imbalance (1:100,000+), focal loss should be combined with hard example mining, data augmentation, or oversampling to ensure sufficient exposure to minority class examples.

References

Lin, T.-Y. et al. (2017). "Focal Loss for Dense Object Detection." ICCV 2017. IEEE. (Original focal loss paper, RetinaNet architecture.)
Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 5.4 (Cross-entropy loss derivation and properties.)
He, K. et al. (2016). "Deep Residual Learning for Image Recognition." CVPR 2016. (Context: backbone architecture used in RetinaNet experiments.)
Cui, Y. et al. (2019). "Class-Balanced Loss Based on Effective Number of Samples." CVPR 2019. (Analysis of class-balanced weighting and comparison with focal loss.)
Li, B. et al. (2020). "Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection." NeurIPS 2020. (Extension of focal loss to quality-aware detection.)
Johnson, J. M. and Khoshgoftaar, T. M. (2019). "Survey on deep learning with class imbalance." Journal of Big Data, 6(1), 1-54.