Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Anomaly Detection

Methods for identifying data points that deviate from expected patterns: isolation forests, one-class SVMs, autoencoders, statistical distances, and why the absence of anomaly labels makes this problem structurally harder than classification.

CoreTier 2Stable~50 min

Why This Matters

Anomaly detection appears in fraud detection, manufacturing quality control, network intrusion detection, and medical diagnostics. Unlike standard classification, you typically have abundant normal data and few or zero labeled anomalies. This asymmetry changes the problem structure: you cannot simply train a binary classifier. Evaluating anomaly detectors requires careful attention to model evaluation metrics like precision-recall curves rather than simple accuracy.

The core challenge is defining "normal" precisely enough that deviations from it are meaningful, without overfitting to the training distribution so tightly that novel but legitimate data gets flagged.

Mental Model

Normal data occupies some region of feature space. Anomalies live outside that region. Every anomaly detection method is, at bottom, a way of estimating the boundary of the normal region or the density within it. Points outside the boundary or in low-density areas are declared anomalous.

Core Definitions

Definition

Anomaly Score

A function s:XRs: \mathcal{X} \to \mathbb{R} assigning a score to each point xx, where higher scores indicate greater abnormality. A point is declared anomalous if s(x)>τs(x) > \tau for some threshold τ\tau.

Definition

Contamination Rate

The fraction of training data assumed to be anomalous. Most methods require specifying α\alpha or a threshold τ\tau. Choosing α\alpha without labeled anomalies is a genuine unsolved problem in practice.

Isolation Forest

The key insight: anomalies are few and different, so they are easier to isolate via random partitioning.

An isolation tree recursively selects a random feature and a random split value within the feature range. Each split isolates some points. Anomalies, being sparse and distant from the bulk of data, require fewer splits to isolate.

Proposition

Expected Path Length for Anomalies

Statement

Let h(x)h(x) be the path length from root to the leaf containing xx in a single isolation tree. For a dataset of nn points, the expected path length of a normal point converges to O(logn)O(\log n), while an isolated anomaly has expected path length O(1)O(1) when it is separated from the bulk of the data by at least one feature.

Intuition

Random splits are unlikely to separate a point that sits in a dense cluster, requiring many cuts to peel it away. A point sitting alone in feature space gets isolated on the first relevant split.

Proof Sketch

The expected path length of a random point in a binary search tree of nn items is 2H(n1)2(n1)/n2H(n-1) - 2(n-1)/n where HH is the harmonic number, giving O(logn)O(\log n). A point far from all others in at least one feature will be split off by the first random partition that selects that feature, giving O(d)O(d) expected splits at worst, which is O(1)O(1) with respect to nn.

Why It Matters

This explains why isolation forests scale well: the algorithm runs in O(nlogn)O(n \log n) per tree and does not require distance computations between all pairs of points.

Failure Mode

When anomalies are not isolated in any single feature but only anomalous in a joint distribution (e.g., individually normal features whose combination is rare), isolation forests perform poorly. Axis-aligned splits cannot detect such anomalies efficiently.

The anomaly score is normalized by the expected path length:

s(x)=2E[h(x)]/c(n)s(x) = 2^{-E[h(x)] / c(n)}

where c(n)=2H(n1)2(n1)/nc(n) = 2H(n-1) - 2(n-1)/n is the average path length in a binary search tree of nn nodes.

One-Class SVM

A one-class SVM finds the smallest hypersphere (or halfspace in feature space) enclosing most of the training data. The primal formulation minimizes the volume of the enclosing region while allowing a fraction ν\nu of points to fall outside.

The optimization problem in kernel feature space:

minw,ρ,ξ12w2ρ+1νni=1nξi\min_{w, \rho, \xi} \frac{1}{2}\|w\|^2 - \rho + \frac{1}{\nu n}\sum_{i=1}^n \xi_i

subject to wΦ(xi)ρξiw \cdot \Phi(x_i) \geq \rho - \xi_i and ξi0\xi_i \geq 0.

The parameter ν\nu upper-bounds the fraction of outliers and lower-bounds the fraction of support vectors. Choosing ν\nu requires prior knowledge of the contamination rate, which is rarely available.

Autoencoder-Based Detection

Train an autoencoder on normal data only, using backpropagation to minimize reconstruction loss. The reconstruction error xx^2\|x - \hat{x}\|^2 serves as the anomaly score. Normal data should reconstruct well; anomalous data, never seen during training, should reconstruct poorly.

This works when the autoencoder bottleneck captures the manifold of normal data. It fails when the autoencoder is too powerful (reconstructs everything, including anomalies) or when normal data has high variance (reconstruction error is noisy even for normal points).

Statistical Methods

Definition

Mahalanobis Distance

For data with mean μ\mu and covariance Σ\Sigma:

DM(x)=(xμ)TΣ1(xμ)D_M(x) = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}

This measures how many "standard deviations" a point is from the center, accounting for correlations between features. Under a Gaussian model, DM2(x)D_M^2(x) follows a χd2\chi^2_d distribution.

The Mahalanobis distance is optimal when data is truly Gaussian. It relies on the covariance structure and eigenvalue decomposition of Σ\Sigma. It fails for multimodal distributions, heavy-tailed distributions, and high-dimensional data where covariance estimation becomes unstable.

Z-scores (z=(xμ)/σz = (x - \mu)/\sigma per feature) are the univariate special case. They ignore feature correlations entirely.

Local Outlier Factor (LOF)

LOF compares the local density around a point to the local density around its neighbors. A point whose neighborhood is much sparser than its neighbors' neighborhoods is a local outlier.

LOFk(x)=1Nk(x)oNk(x)lrdk(o)lrdk(x)\text{LOF}_k(x) = \frac{1}{|N_k(x)|} \sum_{o \in N_k(x)} \frac{\text{lrd}_k(o)}{\text{lrd}_k(x)}

where lrdk(x)\text{lrd}_k(x) is the local reachability density based on kk-nearest neighbors. LOF 1\approx 1 means similar density to neighbors; LOF 1\gg 1 means anomalous.

LOF detects local anomalies that global methods miss. A point can be normal in one region and anomalous in another if densities vary across the data.

Common Confusions

Watch Out

Anomaly detection is not binary classification

Binary classification requires labeled examples of both classes. Anomaly detection assumes you have mostly (or only) normal data. Training a classifier on a heavily imbalanced dataset with a handful of known anomalies is a different problem from unsupervised anomaly detection, and mixing the two frameworks leads to poor results.

Watch Out

Low density does not always mean anomalous

In high dimensions, most of the probability mass lies in a thin shell far from the mode. A point at the mode of a high-dimensional Gaussian actually has low density. Anomaly scores based on raw density estimation can flag the most typical points as anomalous in high dimensions. This is why distance-based and isolation-based methods often outperform direct density estimation.

Watch Out

Threshold selection is part of the problem

Every anomaly detector produces a score; converting it to a binary decision requires a threshold. Choosing this threshold is often harder than building the detector itself, because you have no labeled anomalies to validate against. Reporting precision-recall curves across thresholds is more honest than reporting a single accuracy number.

Key Takeaways

  • Anomaly detection is structurally harder than classification: you lack labels for the class you care about
  • Isolation forest exploits the fact that anomalies are easy to isolate by random partitioning
  • One-class SVM finds a boundary around normal data in kernel space
  • Autoencoder reconstruction error works when the bottleneck captures the normal manifold
  • Mahalanobis distance is optimal for Gaussian data, fragile otherwise
  • LOF detects local anomalies by comparing neighborhood densities
  • Threshold selection is an unsolved problem without labeled anomalies

Exercises

ExerciseCore

Problem

You have a dataset of 10,000 normal network traffic records and want to detect intrusions. An isolation forest with 100 trees gives anomaly scores. The top 1% of scores (100 points) are flagged. You later discover that 50 of these are true intrusions and 50 are false positives. What is the precision? If there were actually 200 intrusions total, what is the recall?

ExerciseAdvanced

Problem

Explain why Mahalanobis distance fails for anomaly detection when the data distribution has two well-separated Gaussian clusters. What happens to a point that lies exactly between the two clusters?

References

Canonical:

  • Liu, Ting, Zhou, "Isolation Forest" (ICDM 2008)
  • Scholkopf et al., "Estimating the Support of a High-Dimensional Distribution" (Neural Computation, 2001)

Current:

  • Ruff et al., "A Unifying Review of Deep Anomaly Detection" (2021), Sections 2-4

  • Chandola, Banerjee, Kumar, "Anomaly Detection: A Survey" (ACM Computing Surveys, 2009)

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.