AI Safety
Out-of-Distribution Detection
Methods for detecting when test inputs differ from training data, where naive softmax confidence fails and principled alternatives based on energy, Mahalanobis distance, and typicality succeed.
Prerequisites
Why This Matters
Deployed models encounter inputs they were not trained on. A medical imaging classifier trained on X-rays will still produce a confident prediction when given a photo of a cat. A fraud detection model trained on 2020 data will silently fail on novel fraud patterns in 2024. OOD detection is the problem of recognizing when a model's input falls outside its training distribution, so you can abstain or escalate rather than trust a meaningless prediction.
The Core Problem
Let be the training distribution and be any distribution not seen during training. Given a new input , we want a scoring function such that is high for and low for .
Out-of-Distribution Input
An input is out-of-distribution with respect to a model trained on data from if is drawn from some where . The boundary between "in" and "out" is task-dependent and requires a decision threshold on the scoring function.
Why Softmax Confidence Fails
The most naive OOD detector uses the maximum softmax probability: . This fails badly.
Softmax Overconfidence on OOD Inputs
Statement
For a neural network with softmax output, there exist inputs far from the training distribution such that is arbitrarily close to 1.
Intuition
Softmax normalizes logits to sum to 1. If one logit is much larger than the others, softmax assigns near-1 probability to that class regardless of whether the input is meaningful. Deep networks produce large logit norms for high-norm inputs, and OOD inputs often have unusual norms.
Proof Sketch
For any , consider inputs where with one logit dominating. Then as the dominant logit grows. ReLU networks produce unbounded outputs on unbounded inputs, so such always exist outside the training support.
Why It Matters
This means you cannot trust softmax probabilities as confidence scores for deployment safety. A model saying "95% cat" on a chest X-ray is not useful.
Failure Mode
Any OOD detection method that relies solely on softmax confidence will miss OOD inputs that happen to produce large logits in one class. This includes adversarial OOD examples constructed to maximize softmax confidence.
Detection Methods
ODIN: Temperature Scaling + Input Perturbation
ODIN (Liang et al., 2018) improves MSP with two tricks. First, divide logits by temperature before softmax to spread out the probability mass. Second, add a small perturbation to the input in the direction that increases the maximum softmax score:
The score is then .
The perturbation amplifies the gap between in-distribution and OOD inputs because in-distribution inputs respond more coherently to gradient-based perturbation.
Energy-Based Detection
Energy Score
Given logits for classes , the energy score is:
Lower energy (more negative) indicates in-distribution. This is the negative log of the partition function of the Gibbs distribution induced by the logits.
Energy Score Separates In- and Out-Distribution
Statement
Under a Gibbs interpretation of the softmax classifier, the expected energy is lower (more negative) than when in-distribution inputs produce larger total logit magnitude than OOD inputs.
Intuition
Energy aggregates all logits via LogSumExp rather than taking only the max. In-distribution inputs activate learned features strongly, producing large logits across relevant classes. OOD inputs produce smaller or more uniform logits, yielding higher energy.
Proof Sketch
The energy is a monotone decreasing function of the total logit scale. Cross-entropy training pushes in-distribution logits to be large and well-separated. OOD inputs, lacking the trained features, produce smaller logit norms on average.
Why It Matters
Energy scoring is a drop-in replacement for MSP that requires no retraining, no hyperparameters (unlike ODIN), and consistently outperforms MSP across benchmarks.
Failure Mode
Fails when OOD inputs happen to strongly activate learned features. For example, a model trained on CIFAR-10 may assign low energy to SVHN digits because digit-like features are present in both distributions.
Mahalanobis Distance in Feature Space
Fit a class-conditional Gaussian to the penultimate-layer features of training data (tied covariance across classes). The OOD score for input with feature is:
More negative values indicate OOD. This works because in-distribution features cluster near class means while OOD features fall in low-density regions of feature space.
Typicality Test
Rather than asking "is this input likely?", ask "is this input typical?" A high-dimensional Gaussian concentrates on a thin shell, not at the mode. An input with very high or very low likelihood under is atypical and likely OOD.
The typicality score compares the log-likelihood of to the expected log-likelihood under :
This catches a failure mode of pure likelihood: generative models can assign higher likelihood to OOD data than in-distribution data (e.g., a CIFAR-10 model assigns higher likelihood to SVHN).
Common Confusions
High likelihood does not mean in-distribution
Deep generative models (VAEs, flows) can assign higher likelihood to OOD data than to training data. This happens because likelihood conflates density with the volume of the typical set. In high dimensions, the typical set occupies a thin shell, and OOD inputs can fall in high-density but atypical regions.
OOD detection is not anomaly detection
Anomaly detection finds unusual points within the training distribution. OOD detection finds points outside the training distribution entirely. A rare but valid medical image is an anomaly; a photo of food is OOD. The methods and assumptions differ.
No free lunch for OOD detection
Every OOD detector makes assumptions about what OOD data looks like. A detector calibrated for far-OOD (random noise) may fail on near-OOD (a closely related but different dataset). You must evaluate against the specific OOD scenarios relevant to your deployment.
Key Takeaways
- Softmax confidence is not a reliable OOD detector; overconfidence on OOD inputs is the norm, not the exception
- Energy scoring uses all logits via LogSumExp and consistently beats MSP with zero additional cost
- ODIN adds temperature scaling and input perturbation for better separation
- Mahalanobis distance exploits the geometry of learned feature space
- Typicality tests address the failure of raw likelihood in high dimensions
- No single method works for all OOD types; evaluate on your specific deployment scenario
Exercises
Problem
A softmax classifier outputs probabilities on an input. Can you conclude the input is in-distribution? Explain why or why not, and describe what additional information you would need.
Problem
Given class-conditional means and shared covariance in a -dimensional feature space, derive the computational cost of computing the Mahalanobis OOD score for a single input. How does this scale with and ?
References
Canonical:
- Hendrycks & Gimpel, "A Baseline for Detecting Misclassified and OOD Examples" (2017)
- Liang et al., "Enhancing The Reliability of OOD Image Detection" (ODIN, 2018)
Current:
- Liu et al., "Energy-based OOD Detection" (NeurIPS 2020)
- Lee et al., "A Simple Unified Framework for Detecting OOD Samples" (Mahalanobis, 2018)
- Nalisnick et al., "Do Deep Generative Models Know What They Don't Know?" (2019), Section 3
Next Topics
- Mechanistic interpretability: understanding what features a model uses
- Hallucination theory: when models confidently produce wrong outputs
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Calibration and Uncertainty QuantificationLayer 3
- Logistic RegressionLayer 1
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Differentiation in RnLayer 0A
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A