ML Methods
Multi-Class and Multi-Label Classification
Multi-class (exactly one label, softmax + cross-entropy) vs multi-label (multiple labels, sigmoid + binary cross-entropy). One-vs-rest, one-vs-one, hierarchical classification, and evaluation metrics.
Prerequisites
Why This Matters
Binary classification is the special case. Real problems often involve multiple classes (image recognition with 1000 categories) or multiple labels per example (a movie that is both "comedy" and "romance"). The choice of output activation, loss function, and evaluation metric must match the problem structure. Using softmax for a multi-label problem is a common and costly mistake.
Multi-Class Classification
Multi-Class Classification
Each input has exactly one correct label . The model outputs a probability distribution over classes: with .
Softmax
The softmax function converts a vector of logits into a valid probability distribution:
Properties: all outputs are positive, they sum to 1, and increasing one logit decreases all other probabilities. This mutual exclusivity is correct for multi-class but wrong for multi-label.
Loss: Categorical Cross-Entropy
where is the true class. This is the negative log-likelihood under a categorical distribution.
Softmax Cross-Entropy Gradient
Statement
The gradient of the categorical cross-entropy loss with respect to the logits has a remarkably simple form:
where and is the true class index.
Intuition
The gradient is the difference between the predicted probability and the target (which is 1 for the true class, 0 for others). If the model assigns probability 0.8 to the correct class, the gradient for that logit is , pushing it up. All other logits have gradients equal to their predicted probabilities (positive, pushing them down). This is the same "prediction minus target" form as in linear and logistic regression.
Proof Sketch
For : . For : . Both cases are captured by .
Why It Matters
This clean gradient is why softmax + cross-entropy is the universal choice for multi-class classification. The gradient never saturates (unlike MSE with sigmoid) and is trivially cheap to compute.
Failure Mode
When classes are hierarchical (e.g., "golden retriever" is a "dog" is an "animal"), flat softmax treats all misclassifications as equally bad. Predicting "labrador" when the answer is "golden retriever" gets the same loss as predicting "airplane". Hierarchical loss functions address this.
Multi-Label Classification
Multi-Label Classification
Each input has a label set that can contain zero, one, or multiple labels. The model outputs independent probabilities for each label . These probabilities do not need to sum to 1.
Sigmoid Per Label
Apply an independent sigmoid to each logit:
Each label's probability is computed independently. The model can output and simultaneously (both labels likely present).
Loss: Binary Cross-Entropy Per Label
This is the sum of independent binary classification losses.
Multi-Label Independence Assumption
Statement
The standard multi-label loss with independent sigmoids assumes conditional independence of labels given the input:
This factorization means the model cannot capture label correlations (e.g., "comedy" and "romantic" co-occurring more often than "comedy" and "horror") without the input features encoding this information.
Intuition
Each label has its own binary classifier. These classifiers share the feature extractor (backbone network) but make independent final predictions. Label correlations can only be captured through shared features, not through the output layer.
Proof Sketch
The likelihood under independent Bernoulli outputs factorizes: . Taking the negative log gives the sum of binary cross-entropies. This is the MLE objective under the independence assumption.
Why It Matters
If label correlations are strong, the independence assumption is limiting. Methods like classifier chains, graph neural networks on the label space, or transformer decoders that attend to previously predicted labels can capture these dependencies.
Failure Mode
With many correlated labels and small datasets, the independence assumption can lead to incoherent predictions (e.g., predicting "pregnant" and "male" simultaneously). Joint modeling of labels helps, but at higher computational cost.
Reduction Strategies
One-vs-Rest (OvR)
Train binary classifiers, one per class. Classifier predicts "class " vs "not class ". At inference, select the class with the highest confidence.
- Pros: simple, parallelizable, works with any binary classifier
- Cons: class imbalance (each binary classifier sees 1 positive class vs negative classes), assumes independence between classifiers
One-vs-One (OvO)
Train binary classifiers, one for each pair of classes. Classifier distinguishes class from class . At inference, use majority voting.
- Pros: each classifier sees balanced data, each is a simpler problem
- Cons: classifiers, expensive for large
For neural networks, OvR and OvO are rarely used. A single network with a softmax output layer handles multi-class classification directly and shares features across classes.
Hierarchical Classification
When classes have a tree structure (e.g., taxonomy), predictions should respect the hierarchy. If the model predicts "golden retriever", it implicitly predicts "dog" and "animal".
Approaches:
- Flat classification with hierarchy-aware loss (penalize less for nearby misclassifications in the tree)
- Top-down classification: predict coarse class first, then refine
- Per-level classifiers: one softmax per level of the hierarchy
Evaluation Metrics
Macro, Micro, and Weighted Averaging
For per-class metrics (precision , recall , F1 ):
- Macro average: . Treats all classes equally.
- Micro average: compute TP, FP, FN globally across all classes, then compute F1. Dominated by frequent classes.
- Weighted average: where is the fraction of samples in class . Accounts for class imbalance.
Use macro averaging when all classes matter equally (even rare ones). Use micro averaging when overall accuracy matters more than per-class performance. Use weighted averaging as a compromise.
Common Confusions
Softmax is wrong for multi-label classification
Softmax forces probabilities to sum to 1, creating competition between labels. If one label's probability goes up, others must go down. For multi-label problems where multiple labels can be simultaneously correct, use independent sigmoids. This is the single most common mistake in classification setup.
Accuracy is misleading for multi-class with imbalanced classes
A model that always predicts the majority class achieves accuracy equal to the majority class frequency. With 90% class A and 10% class B, a trivial model gets 90% accuracy. Use per-class metrics (precision, recall) and appropriate averaging.
Exercises
Problem
You have a document classification problem where each document belongs to exactly one of 50 categories. Which output activation and loss function should you use? What if each document can belong to multiple categories simultaneously?
Problem
You train a multi-class classifier with 100 classes using OvR (one-vs-rest). Each binary classifier achieves 95% accuracy. Can you conclude that the overall multi-class accuracy is 95%? Why or why not?
References
Canonical:
- Bishop, Pattern Recognition and Machine Learning (2006), Chapter 4.3
- Tsoumakas & Katakis, "Multi-Label Classification: An Overview" (2007), IJDWM
Current:
-
Goodfellow, Bengio, Courville, Deep Learning (2016), Chapter 6.2.2
-
Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 1-28
Next Topics
- Cross-entropy loss deep dive: the full theory behind the loss function
- Confusion matrices and classification metrics: detailed evaluation for multi-class problems
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Logistic RegressionLayer 1
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Differentiation in RnLayer 0A
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A