Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Methodology

Confusion Matrix Deep Dive

Beyond binary classification: multi-class confusion matrices, micro vs macro averaging, Matthews correlation coefficient, Cohen's kappa, cost-sensitive evaluation, and how to read a confusion matrix without getting the axes wrong.

CoreTier 1Stable~40 min
0

Why This Matters

Predicted +Predicted -PredictedActual +Actual -Actual85TP15FN10FP890TNPrecisionTP/(TP+FP) = 0.895RecallTP/(TP+FN) = 0.850F1 Score2PR/(P+R) = 0.872Accuracy(TP+TN)/N = 0.975Accuracy is 98.9% but only 85% of actual positives are caught (recall). Precision and recall tell the real story.

Accuracy is a single number that hides critical failure modes. A model with 99% accuracy on a dataset where 99% of examples belong to one class has learned nothing. The confusion matrix is the complete record of what a classifier got right and wrong. Every useful classification metric (precision, recall, F1, MCC, kappa) is a function of the confusion matrix. If you cannot read a confusion matrix, you cannot evaluate a classifier.

Mental Model

A confusion matrix is a table. Rows correspond to true labels. Columns correspond to predicted labels. Entry (i,j)(i, j) counts how many examples with true label ii received predicted label jj. Diagonal entries are correct predictions. Off-diagonal entries are errors. The pattern of errors tells you how the model fails, not just how often.

Convention warning: some libraries (notably scikit-learn) use rows = true, columns = predicted. Others transpose this. Always check.

Formal Setup

Definition

Confusion matrix

For a KK-class classification problem with nn examples, the confusion matrix CC has entries:

Cij={x:true(x)=i and pred(x)=j}C_{ij} = |\{x : \text{true}(x) = i \text{ and } \text{pred}(x) = j\}|

The row sums jCij\sum_j C_{ij} give the number of examples in each true class. The column sums iCij\sum_i C_{ij} give the number of predictions for each class.

Per-Class Metrics

For class kk in a multi-class problem, define the one-vs-rest quantities:

  • True positives: TPk=Ckk\text{TP}_k = C_{kk}
  • False positives: FPk=ikCik\text{FP}_k = \sum_{i \neq k} C_{ik} (column kk minus diagonal)
  • False negatives: FNk=jkCkj\text{FN}_k = \sum_{j \neq k} C_{kj} (row kk minus diagonal)
  • True negatives: TNk=nTPkFPkFNk\text{TN}_k = n - \text{TP}_k - \text{FP}_k - \text{FN}_k

Then:

Precisionk=TPkTPk+FPk,Recallk=TPkTPk+FNk\text{Precision}_k = \frac{\text{TP}_k}{\text{TP}_k + \text{FP}_k}, \quad \text{Recall}_k = \frac{\text{TP}_k}{\text{TP}_k + \text{FN}_k}

Micro vs Macro Averaging

Definition

Macro averaging

Compute the metric for each class independently, then average:

Macro-Precision=1Kk=1KPrecisionk\text{Macro-Precision} = \frac{1}{K} \sum_{k=1}^{K} \text{Precision}_k

Macro averaging weights all classes equally regardless of their size. A rare class with 10 examples counts as much as a common class with 10,000 examples.

Definition

Micro averaging

Pool all TP, FP, FN across classes, then compute the metric once:

Micro-Precision=kTPkk(TPk+FPk)\text{Micro-Precision} = \frac{\sum_k \text{TP}_k}{\sum_k (\text{TP}_k + \text{FP}_k)}

For classification, micro-precision = micro-recall = micro-F1 = accuracy. Micro averaging is dominated by the largest classes.

The choice matters when classes are imbalanced. Macro averaging surfaces failures on rare classes. Micro averaging reflects overall performance weighted by class frequency.

Matthews Correlation Coefficient

Proposition

MCC as Geometric Mean of Regression Coefficients

Statement

The Matthews Correlation Coefficient for binary classification is:

MCC=TPTNFPFN(TP+FP)(TP+FN)(TN+FP)(TN+FN)\text{MCC} = \frac{\text{TP} \cdot \text{TN} - \text{FP} \cdot \text{FN}}{\sqrt{(\text{TP}+\text{FP})(\text{TP}+\text{FN})(\text{TN}+\text{FP})(\text{TN}+\text{FN})}}

This equals the Pearson correlation between the true binary labels and the predicted binary labels. It is also the geometric mean of the two regression coefficients (predicting true from predicted, and predicted from true): MCC=Informedness×Markedness\text{MCC} = \sqrt{\text{Informedness} \times \text{Markedness}}.

Intuition

MCC uses all four quadrants of the confusion matrix. It returns +1 for perfect prediction, 0 for random prediction, and -1 for total disagreement. Unlike F1, which ignores true negatives, MCC penalizes models that succeed on one class by failing on another.

Proof Sketch

Treat the 2×22 \times 2 confusion matrix as a contingency table. The Pearson correlation for binary variables reduces to the phi coefficient, which is exactly the MCC formula. The geometric mean identity follows from Informedness =TPRFPR= \text{TPR} - \text{FPR} and Markedness =PPVFOR= \text{PPV} - \text{FOR}, whose product equals MCC2\text{MCC}^2 by algebraic manipulation of the confusion matrix entries.

Why It Matters

MCC is the single best metric for binary classification on imbalanced datasets. Accuracy can be high despite useless predictions (predict the majority class always). F1 ignores true negatives. MCC accounts for all four cells and is invariant to which class is labeled positive.

Failure Mode

MCC is undefined when any row or column of the confusion matrix sums to zero (division by zero). This happens when the model predicts only one class or when one class is absent from the test set. In multi-class settings, the extension of MCC (the RKR_K coefficient) is less commonly used and harder to interpret.

Cohen's Kappa

Definition

Cohens kappa

Cohen's kappa measures agreement between predicted and true labels corrected for chance:

κ=pope1pe\kappa = \frac{p_o - p_e}{1 - p_e}

where pop_o is observed agreement (accuracy) and pep_e is expected agreement under independence:

pe=k=1K(jCkj)(iCik)n2p_e = \sum_{k=1}^{K} \frac{(\sum_j C_{kj})(\sum_i C_{ik})}{n^2}

κ=1\kappa = 1 means perfect agreement. κ=0\kappa = 0 means no better than random assignment that preserves marginal distributions.

Kappa is useful when the class distribution is skewed and accuracy is misleading. It adjusts for the fact that even a random classifier achieves nonzero accuracy when classes are imbalanced.

Cost-Sensitive Evaluation

Not all errors are equally bad. In medical diagnosis, a false negative (missing a disease) is typically worse than a false positive (unnecessary follow-up). A cost matrix WRK×KW \in \mathbb{R}^{K \times K} assigns a cost WijW_{ij} to predicting class jj when the true class is ii. Correct predictions have Wii=0W_{ii} = 0.

The total cost is:

Cost=i,jWijCij\text{Cost} = \sum_{i,j} W_{ij} \cdot C_{ij}

Accuracy implicitly uses Wij=1W_{ij} = 1 for iji \neq j and Wii=0W_{ii} = 0. Making the cost matrix explicit forces you to state what errors matter.

Canonical Examples

Example

Imbalanced binary classification

A disease screening test on 10,000 patients (100 positive, 9,900 negative):

Pred +Pred -
True +8020
True -2009,700

Accuracy: 9,780/10,000=97.8%9{,}780/10{,}000 = 97.8\%. Precision: 80/280=28.6%80/280 = 28.6\%. Recall: 80/100=80%80/100 = 80\%. MCC: (80970020020)/280100990097200.44(80 \cdot 9700 - 200 \cdot 20)/\sqrt{280 \cdot 100 \cdot 9900 \cdot 9720} \approx 0.44.

The model detects 80% of cases but most positive predictions are wrong. Accuracy alone makes this look good. MCC of 0.44 correctly indicates moderate performance.

Common Confusions

Watch Out

Rows vs columns convention varies across libraries

scikit-learn: rows = true labels, columns = predictions. Some textbooks and R packages use the transpose. TensorFlow's tf.math.confusion_matrix follows scikit-learn. Always verify by checking a known example where you know the true labels and predictions.

Watch Out

F1 and accuracy can both be misleading on imbalanced data

F1 ignores true negatives entirely. A model that predicts "positive" for everything has recall = 1 and nonzero F1, even though it is useless. MCC is more reliable because it uses all four cells. On balanced datasets, F1 and MCC give similar rankings.

Watch Out

Multi-class MCC is not the average of binary MCCs

Extending MCC to K>2K > 2 classes requires the generalized formula using the entire K×KK \times K confusion matrix, not averaging KK one-vs-rest binary MCCs. The generalization treats the confusion matrix as a contingency table and computes a multivariate correlation.

Summary

  • The confusion matrix is the complete classification record; every metric derives from it
  • Macro averaging weights classes equally; micro averaging weights by class frequency
  • MCC uses all four quadrants and is the best single metric for imbalanced binary classification
  • Cohen's kappa corrects for chance agreement
  • Cost matrices make error asymmetry explicit
  • Always check whether your library puts true labels in rows or columns

Exercises

ExerciseCore

Problem

A 3-class confusion matrix is:

Pred APred BPred C
True A4055
True B103010
True C0545

Compute the macro-averaged precision.

ExerciseAdvanced

Problem

Prove that for binary classification, micro-averaged precision equals accuracy.

References

Canonical:

  • Cohen, "A Coefficient of Agreement for Nominal Scales," Educational and Psychological Measurement 20(1), 1960
  • Matthews, "Comparison of the Predicted and Observed Secondary Structure of T4 Phage Lysozyme," BBA 405(2), 1975
  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Section 7.2 (metrics for classification)
  • Powers, "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation," Journal of Machine Learning Technologies 2(1), 2011

Current:

  • Chicco & Jurman, "The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy," BMC Genomics 21, 2020
  • Sokolova & Lapalme, "A systematic analysis of performance measures for classification tasks," Information Processing and Management 45(4), 2009

Next Topics

From confusion matrices, the natural continuations are:

Last reviewed: April 2026

Next Topics