Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

Focal Loss vs. Cross-Entropy Loss

Cross-entropy loss treats all examples equally, weighting each by its negative log-probability. Focal loss multiplies the cross-entropy by a modulating factor that downweights well-classified (easy) examples, focusing training on hard examples. Focal loss is a strict generalization of cross-entropy (setting the focusing parameter to zero recovers cross-entropy). It is most effective for severe class imbalance where easy negatives dominate the gradient.

What Each Does

Both are classification losses. They differ in how they weight contributions from easy versus hard examples.

Cross-entropy loss for a binary classification with true label y{0,1}y \in \{0, 1\} and predicted probability p[0,1]p \in [0, 1]:

CE(p,y)=ylog(p)(1y)log(1p)\text{CE}(p, y) = -y \log(p) - (1 - y)\log(1 - p)

Define pt=pp_t = p if y=1y = 1 and pt=1pp_t = 1 - p if y=0y = 0. Then CE(pt)=log(pt)\text{CE}(p_t) = -\log(p_t). Every example contributes to the loss proportionally to log(pt)-\log(p_t). An example classified with pt=0.9p_t = 0.9 contributes log(0.9)0.105-\log(0.9) \approx 0.105. An example with pt=0.1p_t = 0.1 contributes log(0.1)2.303-\log(0.1) \approx 2.303.

Focal loss (Lin et al., 2017) adds a modulating factor:

FL(pt)=(1pt)γlog(pt)\text{FL}(p_t) = -(1 - p_t)^\gamma \log(p_t)

where γ0\gamma \geq 0 is the focusing parameter. The factor (1pt)γ(1 - p_t)^\gamma is close to 1 when ptp_t is small (hard example) and close to 0 when ptp_t is large (easy example). This downweights the contribution of well-classified examples.

At γ=0\gamma = 0, focal loss reduces to cross-entropy. At γ=2\gamma = 2 (the standard setting), an example classified with pt=0.9p_t = 0.9 has its loss reduced by a factor of (10.9)2=0.01(1 - 0.9)^2 = 0.01. Its contribution drops from 0.105 to 0.00105. An example with pt=0.1p_t = 0.1 has its loss scaled by (10.1)2=0.81(1 - 0.1)^2 = 0.81, a modest reduction from 2.303 to 1.865.

The Class Imbalance Problem

In class-imbalanced problems, the vast majority of examples belong to one class. In object detection, a single image may contain 100,000 candidate anchor boxes but only 10 objects. The ratio of negatives to positives can be 10,000:1.

With cross-entropy, each easy negative contributes a small but nonzero loss. Summed over tens of thousands of easy negatives, these small contributions dominate the total loss and its gradient. The model spends most of its gradient budget on examples it already classifies correctly rather than on the rare, informative hard examples.

The class-balanced variant of cross-entropy reweights by inverse class frequency:

CB-CE(pt)=αtlog(pt)\text{CB-CE}(p_t) = -\alpha_t \log(p_t)

where αt\alpha_t is the inverse frequency weight for the class. This addresses the imbalance in example count but not the imbalance between easy and hard examples. A class-balanced loss still assigns the same weight to an easy positive and a hard positive.

Focal loss addresses both: it reduces the contribution of easy examples regardless of class, and can be combined with class-balanced weights: FL(pt)=αt(1pt)γlog(pt)\text{FL}(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t).

Side-by-Side Comparison

PropertyCross-EntropyFocal Loss
Formulalog(pt)-\log(p_t)(1pt)γlog(pt)-(1 - p_t)^\gamma \log(p_t)
WeightingUniform across examplesDownweights easy, upweights hard
Focusing parameterNone (γ=0\gamma = 0 equivalent)γ\gamma (typically 2)
Effect at pt=0.9p_t = 0.9Loss = 0.105Loss = 0.00105 (γ=2\gamma = 2)
Effect at pt=0.5p_t = 0.5Loss = 0.693Loss = 0.173 (γ=2\gamma = 2)
Effect at pt=0.1p_t = 0.1Loss = 2.303Loss = 1.865 (γ=2\gamma = 2)
Class imbalance handlingNone (or via α\alpha weighting)Built-in via example difficulty weighting
Extra hyperparametersNoneγ\gamma (and optionally α\alpha)
Gradient behaviorProportional to 1/pt1/p_tProportional to (1pt)γ/pt(1-p_t)^\gamma / p_t
Best settingBalanced classesSevere imbalance (1:100+)
Original applicationGeneral classificationDense object detection (RetinaNet)
Computational overheadBaselineNegligible (one extra multiply)

When Each Wins

Cross-entropy wins: balanced or mildly imbalanced data

For problems where classes are roughly balanced or the imbalance is moderate (up to 1:10), standard cross-entropy with no reweighting is sufficient. The gradient contribution from each class is roughly proportional to its representation, and the model sees enough hard examples from both classes. Adding focal loss in this regime provides little benefit and introduces an extra hyperparameter.

Cross-entropy wins: when hard examples are noisy

Focal loss upweights hard examples. If hard examples are primarily noisy or mislabeled rather than genuinely informative, focal loss amplifies the noise. In datasets with substantial label noise, cross-entropy's uniform weighting is more robust because it does not disproportionately trust the hardest examples.

Focal loss wins: severe imbalance in dense prediction

In object detection, semantic segmentation, and other dense prediction tasks, the imbalance between positive and negative regions is extreme. RetinaNet demonstrated that focal loss with γ=2\gamma = 2 and α=0.25\alpha = 0.25 matched two-stage detector performance (which handles imbalance through proposal filtering) using a simpler one-stage architecture. Without focal loss, one-stage detectors were significantly worse.

Focal loss wins: when easy examples dominate the gradient

Any setting where the majority of examples are confidently classified benefits from focal loss. This includes information retrieval (most candidates are clearly irrelevant), medical imaging (most tissue is normal), and fraud detection (most transactions are legitimate). The key diagnostic: if the loss is decreasing but driven almost entirely by easy examples, focal loss redirects optimization toward the informative boundary cases.

The Gradient Perspective

The gradient of cross-entropy with respect to the logit zz (where p=σ(z)p = \sigma(z)) is:

CEz=pt1\frac{\partial \text{CE}}{\partial z} = p_t - 1

The gradient of focal loss is:

FLz=(1pt)γ(pt1)+γ(1pt)γ1ptlog(pt)(pt1)\frac{\partial \text{FL}}{\partial z} = (1 - p_t)^\gamma (p_t - 1) + \gamma (1 - p_t)^{\gamma - 1} p_t \log(p_t) (p_t - 1)

For well-classified examples (pt1p_t \to 1), the cross-entropy gradient approaches 0 linearly, while the focal gradient approaches 0 as (1pt)γ+1(1 - p_t)^{\gamma + 1}, much faster. This means the optimizer receives almost no signal from easy examples, concentrating updates on the hard cases near the decision boundary.

Common Confusions

Watch Out

Focal loss is not only for object detection

The original paper (Lin et al., 2017) introduced focal loss for RetinaNet, but the technique applies to any classification problem with severe imbalance. It has been successfully applied to medical imaging, NLP classification, fraud detection, and other domains. The mechanism (downweight easy examples) is task-agnostic.

Watch Out

Focal loss does not replace class balancing

Focal loss reweights by example difficulty, not by class frequency. In practice, combining focal loss (γ=2\gamma = 2) with class-balanced weights (α\alpha) gives the best results. Using focal loss alone without α\alpha still leaves the gradient dominated by the majority class; it just focuses within each class on hard examples.

Watch Out

The focusing parameter is not always 2

γ=2\gamma = 2 was optimal for the RetinaNet experiments on COCO. Other datasets and tasks may benefit from different values. Lower γ\gamma (e.g., 1) provides less focusing and is better when hard examples are noisy. Higher γ\gamma (e.g., 5) provides more aggressive focusing and may help when the imbalance is extreme and labels are clean. Treat γ\gamma as a hyperparameter to tune.

Watch Out

Focal loss does not solve all imbalance problems

Focal loss addresses the gradient domination problem from easy examples. It does not address the sampling problem: the model may still rarely see hard positives during training. For extreme imbalance (1:100,000+), focal loss should be combined with hard example mining, data augmentation, or oversampling to ensure sufficient exposure to minority class examples.

References

  1. Lin, T.-Y. et al. (2017). "Focal Loss for Dense Object Detection." ICCV 2017. IEEE. (Original focal loss paper, RetinaNet architecture.)
  2. Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 5.4 (Cross-entropy loss derivation and properties.)
  3. He, K. et al. (2016). "Deep Residual Learning for Image Recognition." CVPR 2016. (Context: backbone architecture used in RetinaNet experiments.)
  4. Cui, Y. et al. (2019). "Class-Balanced Loss Based on Effective Number of Samples." CVPR 2019. (Analysis of class-balanced weighting and comparison with focal loss.)
  5. Li, B. et al. (2020). "Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection." NeurIPS 2020. (Extension of focal loss to quality-aware detection.)
  6. Johnson, J. M. and Khoshgoftaar, T. M. (2019). "Survey on deep learning with class imbalance." Journal of Big Data, 6(1), 1-54.