Methodology
Confusion Matrix Deep Dive
Beyond binary classification: multi-class confusion matrices, micro vs macro averaging, Matthews correlation coefficient, Cohen's kappa, cost-sensitive evaluation, and how to read a confusion matrix without getting the axes wrong.
Why This Matters
Accuracy is a single number that hides critical failure modes. A model with 99% accuracy on a dataset where 99% of examples belong to one class has learned nothing. The confusion matrix is the complete record of what a classifier got right and wrong. Every useful classification metric (precision, recall, F1, MCC, kappa) is a function of the confusion matrix. If you cannot read a confusion matrix, you cannot evaluate a classifier.
Mental Model
A confusion matrix is a table. Rows correspond to true labels. Columns correspond to predicted labels. Entry counts how many examples with true label received predicted label . Diagonal entries are correct predictions. Off-diagonal entries are errors. The pattern of errors tells you how the model fails, not just how often.
Convention warning: some libraries (notably scikit-learn) use rows = true, columns = predicted. Others transpose this. Always check.
Formal Setup
Confusion matrix
For a -class classification problem with examples, the confusion matrix has entries:
The row sums give the number of examples in each true class. The column sums give the number of predictions for each class.
Per-Class Metrics
For class in a multi-class problem, define the one-vs-rest quantities:
- True positives:
- False positives: (column minus diagonal)
- False negatives: (row minus diagonal)
- True negatives:
Then:
Micro vs Macro Averaging
Macro averaging
Compute the metric for each class independently, then average:
Macro averaging weights all classes equally regardless of their size. A rare class with 10 examples counts as much as a common class with 10,000 examples.
Micro averaging
Pool all TP, FP, FN across classes, then compute the metric once:
For classification, micro-precision = micro-recall = micro-F1 = accuracy. Micro averaging is dominated by the largest classes.
The choice matters when classes are imbalanced. Macro averaging surfaces failures on rare classes. Micro averaging reflects overall performance weighted by class frequency.
Matthews Correlation Coefficient
MCC as Geometric Mean of Regression Coefficients
Statement
The Matthews Correlation Coefficient for binary classification is:
This equals the Pearson correlation between the true binary labels and the predicted binary labels. It is also the geometric mean of the two regression coefficients (predicting true from predicted, and predicted from true): .
Intuition
MCC uses all four quadrants of the confusion matrix. It returns +1 for perfect prediction, 0 for random prediction, and -1 for total disagreement. Unlike F1, which ignores true negatives, MCC penalizes models that succeed on one class by failing on another.
Proof Sketch
Treat the confusion matrix as a contingency table. The Pearson correlation for binary variables reduces to the phi coefficient, which is exactly the MCC formula. The geometric mean identity follows from Informedness and Markedness , whose product equals by algebraic manipulation of the confusion matrix entries.
Why It Matters
MCC is the single best metric for binary classification on imbalanced datasets. Accuracy can be high despite useless predictions (predict the majority class always). F1 ignores true negatives. MCC accounts for all four cells and is invariant to which class is labeled positive.
Failure Mode
MCC is undefined when any row or column of the confusion matrix sums to zero (division by zero). This happens when the model predicts only one class or when one class is absent from the test set. In multi-class settings, the extension of MCC (the coefficient) is less commonly used and harder to interpret.
Cohen's Kappa
Cohens kappa
Cohen's kappa measures agreement between predicted and true labels corrected for chance:
where is observed agreement (accuracy) and is expected agreement under independence:
means perfect agreement. means no better than random assignment that preserves marginal distributions.
Kappa is useful when the class distribution is skewed and accuracy is misleading. It adjusts for the fact that even a random classifier achieves nonzero accuracy when classes are imbalanced.
Cost-Sensitive Evaluation
Not all errors are equally bad. In medical diagnosis, a false negative (missing a disease) is typically worse than a false positive (unnecessary follow-up). A cost matrix assigns a cost to predicting class when the true class is . Correct predictions have .
The total cost is:
Accuracy implicitly uses for and . Making the cost matrix explicit forces you to state what errors matter.
Canonical Examples
Imbalanced binary classification
A disease screening test on 10,000 patients (100 positive, 9,900 negative):
| Pred + | Pred - | |
|---|---|---|
| True + | 80 | 20 |
| True - | 200 | 9,700 |
Accuracy: . Precision: . Recall: . MCC: .
The model detects 80% of cases but most positive predictions are wrong. Accuracy alone makes this look good. MCC of 0.44 correctly indicates moderate performance.
Common Confusions
Rows vs columns convention varies across libraries
scikit-learn: rows = true labels, columns = predictions.
Some textbooks and R packages use the transpose. TensorFlow's
tf.math.confusion_matrix follows scikit-learn. Always verify by checking
a known example where you know the true labels and predictions.
F1 and accuracy can both be misleading on imbalanced data
F1 ignores true negatives entirely. A model that predicts "positive" for everything has recall = 1 and nonzero F1, even though it is useless. MCC is more reliable because it uses all four cells. On balanced datasets, F1 and MCC give similar rankings.
Multi-class MCC is not the average of binary MCCs
Extending MCC to classes requires the generalized formula using the entire confusion matrix, not averaging one-vs-rest binary MCCs. The generalization treats the confusion matrix as a contingency table and computes a multivariate correlation.
Summary
- The confusion matrix is the complete classification record; every metric derives from it
- Macro averaging weights classes equally; micro averaging weights by class frequency
- MCC uses all four quadrants and is the best single metric for imbalanced binary classification
- Cohen's kappa corrects for chance agreement
- Cost matrices make error asymmetry explicit
- Always check whether your library puts true labels in rows or columns
Exercises
Problem
A 3-class confusion matrix is:
| Pred A | Pred B | Pred C | |
|---|---|---|---|
| True A | 40 | 5 | 5 |
| True B | 10 | 30 | 10 |
| True C | 0 | 5 | 45 |
Compute the macro-averaged precision.
Problem
Prove that for binary classification, micro-averaged precision equals accuracy.
References
Canonical:
- Cohen, "A Coefficient of Agreement for Nominal Scales," Educational and Psychological Measurement 20(1), 1960
- Matthews, "Comparison of the Predicted and Observed Secondary Structure of T4 Phage Lysozyme," BBA 405(2), 1975
- Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Section 7.2 (metrics for classification)
- Powers, "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation," Journal of Machine Learning Technologies 2(1), 2011
Current:
- Chicco & Jurman, "The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy," BMC Genomics 21, 2020
- Sokolova & Lapalme, "A systematic analysis of performance measures for classification tasks," Information Processing and Management 45(4), 2009
Next Topics
From confusion matrices, the natural continuations are:
- Cross-validation theory: how to generate reliable confusion matrices
- Hypothesis testing for ML: testing whether one classifier is statistically better than another
Last reviewed: April 2026