Anomaly Detection

Sneiderman, Robby

ML Methods

Anomaly Detection

Methods for identifying data points that deviate from expected patterns: isolation forests, one-class SVMs, autoencoders, statistical distances, and why the absence of anomaly labels makes this problem structurally harder than classification.

CoreTier 2StableSupporting~50 min

Prerequisites

Common Probability Distributions Anomaly Detection Gravitational Waves

Start 8-question practice · 3 available Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 2 | tier 2. This page has 2 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Anomaly detection appears in fraud detection, manufacturing quality control, network intrusion detection, and medical diagnostics. Unlike standard classification, you typically have abundant normal data and few or zero labeled anomalies. This asymmetry changes the problem structure: you cannot simply train a binary classifier. Evaluating anomaly detectors requires careful attention to model evaluation metrics like precision-recall curves rather than simple accuracy.

The core challenge is defining "normal" precisely enough that deviations from it are meaningful, without overfitting to the training distribution so tightly that novel but legitimate data gets flagged.

Mental Model

Normal data occupies some region of feature space. Anomalies live outside that region. Every anomaly detection method is, at bottom, a way of estimating the boundary of the normal region or the density within it. Points outside the boundary or in low-density areas are declared anomalous.

Core Definitions

Definition

Anomaly Score $s (x)$

A function $s: \mathcal{X} \to \mathbb{R}$ assigning a score to each point $x$ , where higher scores indicate greater abnormality. A point is declared anomalous if $s(x) > \tau$ for some threshold $\tau$ .

Definition

Contamination Rate $α$

The fraction of training data assumed to be anomalous. Most methods require specifying $\alpha$ or a threshold $\tau$ . Choosing $\alpha$ without labeled anomalies is a genuine unsolved problem in practice.

Isolation Forest

The key insight: anomalies are few and different, so they are easier to isolate via random partitioning.

An isolation tree recursively selects a random feature and a random split value within the feature range. Each split isolates some points. Anomalies, being sparse and distant from the bulk of data, require fewer splits to isolate.

Proposition

Expected Path Length for Anomalies

Statement

Let $h(x)$ be the path length from root to the leaf containing $x$ in a single isolation tree. For a dataset of $n$ points, the expected path length of a normal point converges to $O(\log n)$ , while an isolated anomaly has expected path length $O(1)$ when it is separated from the bulk of the data by at least one feature.

Intuition

Random splits are unlikely to separate a point that sits in a dense cluster, requiring many cuts to peel it away. A point sitting alone in feature space gets isolated on the first relevant split.

Proof Sketch

The expected path length of a random point in a binary search tree of $n$ items is $2H(n-1) - 2(n-1)/n$ where $H$ is the harmonic number, giving $O(\log n)$ . A point far from all others in at least one feature will be split off by the first random partition that selects that feature, giving $O(d)$ expected splits at worst, which is $O(1)$ with respect to $n$ .

Why It Matters

This explains why isolation forests scale well: the algorithm runs in $O(n \log n)$ per tree and does not require distance computations between all pairs of points.

Failure Mode

When anomalies are not isolated in any single feature but only anomalous in a joint distribution (e.g., individually normal features whose combination is rare), isolation forests perform poorly. Axis-aligned splits cannot detect such anomalies efficiently.

report a correction →

The anomaly score is normalized by the expected path length:

$s(x) = 2^{-E[h(x)] / c(n)}$

where $c(n) = 2H(n-1) - 2(n-1)/n$ is the average path length in a binary search tree of $n$ nodes.

One-Class SVM and SVDD

Two distinct one-class methods are often confused.

One-Class SVM (Schölkopf et al. 2001) separates the training data from the origin in kernel feature space using a halfspace $\{z : w \cdot z \geq \rho\}$ with maximum margin from the origin. The primal is:

$\min_{w, \rho, \xi} \frac{1}{2}\|w\|^2 - \rho + \frac{1}{\nu n}\sum_{i=1}^n \xi_i$

subject to $w \cdot \Phi(x_i) \geq \rho - \xi_i$ and $\xi_i \geq 0$ . Anomaly score: $\rho - w \cdot \Phi(x)$ .

Support Vector Data Description (SVDD; Tax and Duin 2004) finds the smallest enclosing hypersphere $\{z : \|z - c\|^2 \leq R^2\}$ around the data in feature space:

$\min_{c, R, \xi} R^2 + \frac{1}{\nu n}\sum_{i=1}^n \xi_i \quad \text{s.t.} \quad \|\Phi(x_i) - c\|^2 \leq R^2 + \xi_i, \; \xi_i \geq 0.$

Anomaly score: $\|\Phi(x) - c\|^2 - R^2$ .

For a translation-invariant kernel (e.g., RBF) where $\|\Phi(x)\|$ is constant, the two formulations coincide because all feature vectors lie on a fixed-radius sphere; the smallest enclosing sphere and the maximum-margin halfspace from the origin yield the same decision boundary. For general kernels they are different methods. The parameter $\nu$ in both upper-bounds the fraction of outliers and lower-bounds the fraction of support vectors. Choosing $\nu$ requires prior knowledge of the contamination rate, which is rarely available.

Autoencoder-Based Detection

Train an autoencoder on normal data only, using backpropagation to minimize reconstruction loss. The reconstruction error $\|x - \hat{x}\|^2$ serves as the anomaly score. Normal data should reconstruct well; anomalous data, never seen during training, should reconstruct poorly.

This works when the autoencoder bottleneck captures the manifold of normal data. It fails when the autoencoder is too powerful (reconstructs everything, including anomalies) or when normal data has high variance (reconstruction error is noisy even for normal points).

Statistical Methods

Definition

Mahalanobis Distance $D_{M} (x)$

For data with mean $\mu$ and covariance $\Sigma$ :

$D_M(x) = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}$

This measures how many "standard deviations" a point is from the center, accounting for correlations between features. Under a Gaussian model, $D_M^2(x)$ follows a $\chi^2_d$ distribution.

The Mahalanobis distance is optimal when data is truly Gaussian. It relies on the covariance structure and eigenvalue decomposition of $\Sigma$ . It fails for multimodal distributions, heavy-tailed distributions, and high-dimensional data where covariance estimation becomes unstable.

Z-scores ( $z = (x - \mu)/\sigma$ per feature) are the univariate special case. They ignore feature correlations entirely.

Local Outlier Factor (LOF)

LOF compares the local density around a point to the local density around its neighbors. A point whose neighborhood is much sparser than its neighbors' neighborhoods is a local outlier.

$\text{LOF}_k(x) = \frac{1}{|N_k(x)|} \sum_{o \in N_k(x)} \frac{\text{lrd}_k(o)}{\text{lrd}_k(x)}$

where $\text{lrd}_k(x)$ is the local reachability density based on $k$ -nearest neighbors. LOF $\approx 1$ means similar density to neighbors; LOF $\gg 1$ means anomalous.

LOF detects local anomalies that global methods miss. A point can be normal in one region and anomalous in another if densities vary across the data.

Common Confusions

Watch Out

Anomaly detection is not binary classification

Binary classification requires labeled examples of both classes. Anomaly detection assumes you have mostly (or only) normal data. Training a classifier on a heavily imbalanced dataset with a handful of known anomalies is a different problem from unsupervised anomaly detection, and mixing the two frameworks leads to poor results.

Watch Out

Density at the mode versus typical-set probability mass

The density $p(x)$ is highest at the mode by definition, but in high dimensions the mode is not a typical sample. For an isotropic Gaussian $\mathcal{N}(0, I_d)$ in $\mathbb{R}^d$ , the squared norm $\|x\|^2$ concentrates around $d$ , so almost all mass lies in a thin shell of radius $\sqrt{d}$ around the origin. The volume element grows as $r^{d-1}$ , so even though the density is maximized at $r=0$ , the probability mass in a ball of radius $\epsilon$ around the mode is exponentially small in $d$ . This means a sample drawn near the mode is itself unusual relative to the typical set. Anomaly scores based on density alone can therefore flag points that are atypical in mass-terms but high in density (or vice versa) in high dimensions, which is why distance-based and isolation-based methods often outperform direct density estimation.

Watch Out

Threshold selection is part of the problem

Every anomaly detector produces a score; converting it to a binary decision requires a threshold. Choosing this threshold is often harder than building the detector itself, because you have no labeled anomalies to validate against. Reporting precision-recall curves across thresholds is more honest than reporting a single accuracy number.

Summary

Anomaly detection is structurally harder than classification: you lack labels for the class you care about
Isolation forest exploits the fact that anomalies are easy to isolate by random partitioning
One-class SVM finds a boundary around normal data in kernel space
Autoencoder reconstruction error works when the bottleneck captures the normal manifold
Mahalanobis distance is optimal for Gaussian data, fragile otherwise
LOF detects local anomalies by comparing neighborhood densities
Threshold selection is an unsolved problem without labeled anomalies

Exercises

ExerciseCore

Problem

You have a dataset of 10,000 normal network traffic records and want to detect intrusions. An isolation forest with 100 trees gives anomaly scores. The top 1% of scores (100 points) are flagged. You later discover that 50 of these are true intrusions and 50 are false positives. What is the precision? If there were actually 200 intrusions total, what is the recall?

ExerciseAdvanced

Problem

Explain why Mahalanobis distance fails for anomaly detection when the data distribution has two well-separated Gaussian clusters. What happens to a point that lies exactly between the two clusters?

References

Canonical:

Liu, Ting, Zhou, "Isolation Forest" (ICDM 2008). arXiv:0711.1558
Schölkopf, Platt, Shawe-Taylor, Smola, Williamson, "Estimating the Support of a High-Dimensional Distribution" (Neural Computation 13(7), 2001), 1443-1471. Halfspace one-class SVM.
Tax and Duin, "Support Vector Data Description" (Machine Learning 54(1), 2004), 45-66. Smallest enclosing hypersphere.

Current:

Ruff et al., "A Unifying Review of Deep Anomaly Detection" (2021), Sections 2-4. arXiv:2009.11732
Chandola, Banerjee, Kumar, "Anomaly Detection: A Survey" (ACM Computing Surveys, 2009)
Breunig, Kriegel, Ng, Sander, "LOF: Identifying Density-Based Local Outliers" (ACM SIGMOD, 2000), 93-104
Ruff et al., "Deep One-Class Classification" (ICML, 2018). arXiv:1802.08637

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Common Probability Distributionslayer 0A · tier 1
Anomaly Detection for Gravitational Waveslayer 4 · tier 3

Derived topics

0

No published topic currently declares this as a prerequisite.