Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Multi-Class and Multi-Label Classification

Multi-class (exactly one label, softmax + cross-entropy) vs multi-label (multiple labels, sigmoid + binary cross-entropy). One-vs-rest, one-vs-one, hierarchical classification, and evaluation metrics.

CoreTier 2Stable~40 min

Prerequisites

0

Why This Matters

Binary classification is the special case. Real problems often involve multiple classes (image recognition with 1000 categories) or multiple labels per example (a movie that is both "comedy" and "romance"). The choice of output activation, loss function, and evaluation metric must match the problem structure. Using softmax for a multi-label problem is a common and costly mistake.

Multi-Class Classification

Definition

Multi-Class Classification

Each input xx has exactly one correct label y{1,,K}y \in \{1, \ldots, K\}. The model outputs a probability distribution over KK classes: pk(x)=P(Y=kX=x)p_k(x) = P(Y = k \mid X = x) with k=1Kpk(x)=1\sum_{k=1}^K p_k(x) = 1.

Softmax

The softmax function converts a vector of logits zRKz \in \mathbb{R}^K into a valid probability distribution:

pk=ezkj=1Kezjp_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}

Properties: all outputs are positive, they sum to 1, and increasing one logit decreases all other probabilities. This mutual exclusivity is correct for multi-class but wrong for multi-label.

Loss: Categorical Cross-Entropy

=logpc\ell = -\log p_c

where cc is the true class. This is the negative log-likelihood under a categorical distribution.

Proposition

Softmax Cross-Entropy Gradient

Statement

The gradient of the categorical cross-entropy loss with respect to the logits zkz_k has a remarkably simple form:

zk=pk1[k=c]\frac{\partial \ell}{\partial z_k} = p_k - \mathbb{1}[k = c]

where pk=softmax(z)kp_k = \text{softmax}(z)_k and cc is the true class index.

Intuition

The gradient is the difference between the predicted probability and the target (which is 1 for the true class, 0 for others). If the model assigns probability 0.8 to the correct class, the gradient for that logit is 0.81=0.20.8 - 1 = -0.2, pushing it up. All other logits have gradients equal to their predicted probabilities (positive, pushing them down). This is the same "prediction minus target" form as in linear and logistic regression.

Proof Sketch

For k=ck = c: /zc=logpc/zc=(1pc)=pc1\partial \ell / \partial z_c = -\partial \log p_c / \partial z_c = -(1 - p_c) = p_c - 1. For kck \neq c: /zk=logpc/zk=pk\partial \ell / \partial z_k = -\partial \log p_c / \partial z_k = p_k. Both cases are captured by pk1[k=c]p_k - \mathbb{1}[k = c].

Why It Matters

This clean gradient is why softmax + cross-entropy is the universal choice for multi-class classification. The gradient never saturates (unlike MSE with sigmoid) and is trivially cheap to compute.

Failure Mode

When classes are hierarchical (e.g., "golden retriever" is a "dog" is an "animal"), flat softmax treats all misclassifications as equally bad. Predicting "labrador" when the answer is "golden retriever" gets the same loss as predicting "airplane". Hierarchical loss functions address this.

Multi-Label Classification

Definition

Multi-Label Classification

Each input xx has a label set Y{1,,K}Y \subseteq \{1, \ldots, K\} that can contain zero, one, or multiple labels. The model outputs independent probabilities pk(x)=P(label k presentX=x)p_k(x) = P(\text{label } k \text{ present} \mid X = x) for each label kk. These probabilities do not need to sum to 1.

Sigmoid Per Label

Apply an independent sigmoid to each logit:

pk=σ(zk)=11+ezkp_k = \sigma(z_k) = \frac{1}{1 + e^{-z_k}}

Each label's probability is computed independently. The model can output p1=0.9p_1 = 0.9 and p2=0.8p_2 = 0.8 simultaneously (both labels likely present).

Loss: Binary Cross-Entropy Per Label

=1Kk=1K[yklogpk+(1yk)log(1pk)]\ell = -\frac{1}{K}\sum_{k=1}^K [y_k \log p_k + (1 - y_k) \log(1 - p_k)]

This is the sum of KK independent binary classification losses.

Proposition

Multi-Label Independence Assumption

Statement

The standard multi-label loss with independent sigmoids assumes conditional independence of labels given the input:

P(Y1,,YKX=x)=k=1KP(YkX=x)P(Y_1, \ldots, Y_K \mid X = x) = \prod_{k=1}^K P(Y_k \mid X = x)

This factorization means the model cannot capture label correlations (e.g., "comedy" and "romantic" co-occurring more often than "comedy" and "horror") without the input features encoding this information.

Intuition

Each label has its own binary classifier. These classifiers share the feature extractor (backbone network) but make independent final predictions. Label correlations can only be captured through shared features, not through the output layer.

Proof Sketch

The likelihood under independent Bernoulli outputs factorizes: kpkyk(1pk)1yk\prod_k p_k^{y_k}(1-p_k)^{1-y_k}. Taking the negative log gives the sum of binary cross-entropies. This is the MLE objective under the independence assumption.

Why It Matters

If label correlations are strong, the independence assumption is limiting. Methods like classifier chains, graph neural networks on the label space, or transformer decoders that attend to previously predicted labels can capture these dependencies.

Failure Mode

With many correlated labels and small datasets, the independence assumption can lead to incoherent predictions (e.g., predicting "pregnant" and "male" simultaneously). Joint modeling of labels helps, but at higher computational cost.

Reduction Strategies

One-vs-Rest (OvR)

Train KK binary classifiers, one per class. Classifier kk predicts "class kk" vs "not class kk". At inference, select the class with the highest confidence.

  • Pros: simple, parallelizable, works with any binary classifier
  • Cons: class imbalance (each binary classifier sees 1 positive class vs K1K-1 negative classes), assumes independence between classifiers

One-vs-One (OvO)

Train K(K1)/2K(K-1)/2 binary classifiers, one for each pair of classes. Classifier (j,k)(j, k) distinguishes class jj from class kk. At inference, use majority voting.

  • Pros: each classifier sees balanced data, each is a simpler problem
  • Cons: O(K2)O(K^2) classifiers, expensive for large KK

For neural networks, OvR and OvO are rarely used. A single network with a softmax output layer handles multi-class classification directly and shares features across classes.

Hierarchical Classification

When classes have a tree structure (e.g., taxonomy), predictions should respect the hierarchy. If the model predicts "golden retriever", it implicitly predicts "dog" and "animal".

Approaches:

  • Flat classification with hierarchy-aware loss (penalize less for nearby misclassifications in the tree)
  • Top-down classification: predict coarse class first, then refine
  • Per-level classifiers: one softmax per level of the hierarchy

Evaluation Metrics

Definition

Macro, Micro, and Weighted Averaging

For per-class metrics (precision PkP_k, recall RkR_k, F1 FkF_k):

  • Macro average: 1Kk=1KFk\frac{1}{K} \sum_{k=1}^K F_k. Treats all classes equally.
  • Micro average: compute TP, FP, FN globally across all classes, then compute F1. Dominated by frequent classes.
  • Weighted average: k=1KwkFk\sum_{k=1}^K w_k F_k where wkw_k is the fraction of samples in class kk. Accounts for class imbalance.

Use macro averaging when all classes matter equally (even rare ones). Use micro averaging when overall accuracy matters more than per-class performance. Use weighted averaging as a compromise.

Common Confusions

Watch Out

Softmax is wrong for multi-label classification

Softmax forces probabilities to sum to 1, creating competition between labels. If one label's probability goes up, others must go down. For multi-label problems where multiple labels can be simultaneously correct, use independent sigmoids. This is the single most common mistake in classification setup.

Watch Out

Accuracy is misleading for multi-class with imbalanced classes

A model that always predicts the majority class achieves accuracy equal to the majority class frequency. With 90% class A and 10% class B, a trivial model gets 90% accuracy. Use per-class metrics (precision, recall) and appropriate averaging.

Exercises

ExerciseCore

Problem

You have a document classification problem where each document belongs to exactly one of 50 categories. Which output activation and loss function should you use? What if each document can belong to multiple categories simultaneously?

ExerciseAdvanced

Problem

You train a multi-class classifier with 100 classes using OvR (one-vs-rest). Each binary classifier achieves 95% accuracy. Can you conclude that the overall multi-class accuracy is 95%? Why or why not?

References

Canonical:

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapter 4.3
  • Tsoumakas & Katakis, "Multi-Label Classification: An Overview" (2007), IJDWM

Current:

  • Goodfellow, Bengio, Courville, Deep Learning (2016), Chapter 6.2.2

  • Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 1-28

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics