ML Methods
Loss Functions Catalog
A systematic catalog of loss functions for classification, regression, and representation learning: cross-entropy, MSE, hinge, Huber, focal, KL divergence, and contrastive loss.
Prerequisites
Why This Matters
The loss function defines what "good" means for your model. Two models with identical architectures trained with different loss functions will learn different things. In many practical settings, switching the loss function improves performance more than switching the architecture. The choice of loss encodes your assumptions about the problem: noise distribution, outlier sensitivity, class balance, and what errors cost. See cross-entropy loss deep dive for a detailed treatment of the most common classification loss.
Mental Model
A loss function measures the cost of predicting when the truth is . Different losses penalize different types of errors. MSE penalizes large errors quadratically, making it the natural choice for linear regression under Gaussian noise. MAE penalizes all errors linearly (robust to outliers). Cross-entropy penalizes confident wrong classification predictions severely. The right loss depends on what errors matter in your application.
Classification Losses
Cross-Entropy Loss
For a classification problem with classes, the cross-entropy loss for a single example with true label (one-hot encoded) and predicted probabilities is:
For binary classification with and predicted probability :
Cross-entropy has a critical property: as for the true class, the loss goes to infinity. This severe penalty for confident wrong predictions drives the model to assign high probability to the correct class.
Hinge Loss
For binary classification with and raw prediction :
The loss is zero when (correct prediction with margin at least 1). This is the loss used by support vector machines.
Hinge loss does not require probability outputs and is not differentiable at . In practice, subgradient methods handle the non-differentiability.
Focal Loss
For binary classification with true class probability (the model's predicted probability for the true class):
where is a focusing parameter. When , this reduces to cross-entropy.
Focal loss down-weights easy examples (where is high). For , an example with gets weight , while an example with gets weight . This concentrates learning on hard examples, which is critical for class-imbalanced problems like object detection where 99%+ of candidates are background.
Regression Losses
Mean Squared Error
For a regression problem with prediction and target :
MSE is the maximum likelihood estimator under a Gaussian noise model: where .
Huber Loss
For a threshold :
where is the residual. Huber loss is quadratic for small errors and linear for large errors.
Huber loss combines the benefits of MSE (smooth, efficient for Gaussian errors) and MAE (robust to outliers). The parameter controls the transition. When is large, Huber approaches MSE. When is small, it approaches MAE.
Divergence-Based Losses
KL Divergence Loss
The Kullback-Leibler divergence from distribution to distribution is:
KL divergence is non-negative ( by Gibbs' inequality) and equals zero if and only if . It is not symmetric: in general.
KL divergence is used in knowledge distillation (matching a student's output distribution to a teacher's), variational autoencoders (regularizing the latent distribution toward a prior), and reinforcement learning from human feedback (penalizing deviation from a reference policy).
Contrastive Loss
For a pair of examples with label indicating whether they are similar:
where is a distance function and is a margin. Similar pairs are pulled together; dissimilar pairs are pushed apart until they are at least distance apart.
Main Theorems
Cross-Entropy Minimization Equals Maximum Likelihood
Statement
For a model parameterized by that outputs class probabilities , minimizing the average cross-entropy loss on a dataset is equivalent to maximizing the log-likelihood:
Intuition
Cross-entropy loss for a one-hot target is just : the negative log-probability of the true class. Summing over examples gives the negative log-likelihood. Minimizing the negative is maximizing the positive.
Proof Sketch
Expand the cross-entropy: . For a one-hot label where and for , this simplifies to . Sum over the dataset and negate.
Why It Matters
This equivalence connects two perspectives: the information-theoretic view (cross-entropy measures how many extra bits your model needs) and the statistical view (maximum likelihood is the optimal estimator under regularity conditions). It justifies using cross-entropy as the default classification loss.
Failure Mode
The equivalence holds only when the model outputs valid probability distributions (non-negative, sum to 1). If the model is miscalibrated (probabilities do not reflect true frequencies), cross-entropy still works for discrimination but the probabilistic interpretation breaks down.
Huber Loss Bounded Influence Function
Statement
The influence function of the Huber loss estimator is bounded: for any observation ,
where . In contrast, the influence function of MSE is unbounded: , which grows without limit.
Intuition
MSE's gradient is proportional to the residual, so a single outlier with residual 1000 exerts 1000x more influence than a typical point. Huber's gradient is capped at , so no single point can dominate the gradient regardless of how far it is from the prediction.
Proof Sketch
The gradient of Huber loss is for and for . The maximum absolute value is , achieved for all .
Why It Matters
In real datasets, outliers are common (mislabeled examples, sensor errors, data entry mistakes). Huber loss provides a principled way to limit their influence without requiring explicit outlier removal. The parameter controls the tradeoff: smaller means more robustness but less statistical efficiency under Gaussian noise.
Failure Mode
Huber loss is robust to outliers in the target , not in the input . A leverage point (outlier in input space) can still distort the fit. For robustness to both, you need methods from robust regression (e.g., M-estimators with bounded leverage).
Why Loss Choice Matters More Than Architecture
For a fixed architecture, the loss function determines what the model optimizes. Concrete examples:
- Object detection with cross-entropy treats all misclassifications equally. With focal loss, the model focuses on hard negatives and achieves significantly higher mAP.
- Regression with MSE on heavy-tailed data produces estimates pulled toward outliers. Switching to Huber or MAE can reduce test error by 20%+ without changing the model.
- Knowledge distillation with hard labels (cross-entropy on argmax) loses information. Soft labels with KL divergence preserve the teacher's inter-class relationships.
Common Confusions
Cross-entropy and log loss are the same thing
In the binary case, cross-entropy loss and log loss (logistic loss) are identical: . In the multi-class case, "log loss" typically refers to the same formula as multi-class cross-entropy. The terms are interchangeable.
KL divergence is not a distance
and KL divergence does not satisfy the triangle inequality. It is a divergence, not a metric. The direction matters: penalizes places where but (mode-seeking when optimizing ), while does the reverse (mean-seeking).
Hinge loss does not produce probability estimates
Unlike cross-entropy, hinge loss does not require or produce probability outputs. An SVM's raw output is a signed distance from the decision boundary, not a probability. To get probabilities from an SVM, you need Platt scaling as a post-processing step.
Key Takeaways
- Cross-entropy = negative log-likelihood for classification; the default choice
- MSE assumes Gaussian noise; use Huber or MAE when outliers are present
- Focal loss addresses class imbalance by down-weighting easy examples
- Hinge loss creates maximum-margin classifiers (SVMs)
- KL divergence measures distributional mismatch; critical for distillation and VAEs
- Contrastive loss learns representations by comparing pairs
- The choice of loss encodes assumptions about noise, class balance, and error costs
Exercises
Problem
Compute the cross-entropy loss for a 3-class problem where the true label is class 2 (zero-indexed) and the model predicts .
Problem
For Huber loss with , compute the loss for residuals , , and . Compare with MSE for the same residuals.
Problem
Show that focal loss with reduces to cross-entropy, and explain why increasing concentrates the loss on hard examples. Compute the ratio of focal loss at to focal loss at for and .
References
Canonical:
- Bishop, Pattern Recognition and Machine Learning (2006), Chapter 4.3 (cross-entropy), Chapter 7.1 (SVM/hinge)
- Huber, "Robust Estimation of a Location Parameter" (1964), Annals of Mathematical Statistics
Current:
-
Lin et al., "Focal Loss for Dense Object Detection" (2017), ICCV
-
Khosla et al., "Supervised Contrastive Learning" (2020), NeurIPS
-
Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 1-28
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Logistic RegressionLayer 1
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Differentiation in RnLayer 0A
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A