Robust Statistics and M-Estimators

Real data is messy. Sensor readings glitch. Labels get corrupted. Distributions have heavier tails than your model assumes. Classical estimators like the sample mean and ordinary least squares are exquisitely sensitive to such problems: a single outlier can drag the estimate arbitrarily far from the truth.

Five-panel infographic: why robust statistics matters when outliers and heavy tails break maximum likelihood, M-estimators as a generalization that minimizes a sum of robust loss functions (Huber, Tukey biweight), influence function and breakdown point as robustness summaries, comparison of MLE vs Huber vs median, and modern uses (robust regression, RLHF reward shaping, federated mean estimation). — Robust statistics replaces fragile likelihood-based estimators with M-estimators that bound the influence of any single observation. Breakdown point and influence function are the two key robustness summaries.

Robust statistics asks: can we build estimators that work reasonably well under mild assumptions, degrading gracefully when those assumptions are violated? M-estimators are the main tool for doing this. If you have ever used Huber loss, you have used an M-estimator.

Mental Model

Think of the sample mean as a tug-of-war: every data point pulls with equal force. An outlier at $10^6$ pulls just as hard as a normal observation. The median, by contrast, only cares about order, not magnitude, so outliers cannot exert unbounded influence.

M-estimators sit on a spectrum between these extremes. By choosing a loss function $\rho$ that grows less aggressively than the quadratic, you limit the pull of outliers while still using magnitude information from well-behaved observations.

Formal Setup and Notation

Let $X_1, \ldots, X_n$ be i.i.d. observations from some distribution $F$ .

Definition

M-Estimator $\hat{θ}_{n}$

An M-estimator of a location parameter $\theta$ is any value that minimizes:

$\hat{\theta}_n = \arg\min_{\theta} \sum_{i=1}^{n} \rho(X_i - \theta)$

where $\rho: \mathbb{R} \to \mathbb{R}$ is a loss function. Equivalently, $\hat{\theta}_n$ solves the estimating equation:

$\sum_{i=1}^{n} \psi(X_i - \hat{\theta}_n) = 0$

where $\psi = \rho'$ is the influence function of $\rho$ .

Definition

Influence Function $I F (x; T, F)$

The influence function of a statistical functional $T$ at distribution $F$ is:

$IF(x; T, F) = \lim_{\epsilon \to 0} \frac{T((1 - \epsilon)F + \epsilon \delta_x) - T(F)}{\epsilon}$

where $\delta_x$ is a point mass at $x$ . This measures how much a single contamination point at $x$ shifts the estimate.

Definition

Breakdown Point

The breakdown point of an estimator $T$ is the largest fraction $\epsilon^*$ of the data that can be replaced by arbitrary values before the estimator becomes unbounded (or otherwise useless):

$\epsilon^* = \sup \{\epsilon : \text{bias}(T, \epsilon) < \infty\}$

The sample mean has breakdown point $0$ (one outlier can make it infinite). The median has breakdown point $1/2$ (the best possible).

Core Definitions

Common $\rho$ functions and their properties:

The quadratic loss $\rho(r) = r^2 / 2$ yields the sample mean. Its $\psi$ -function is $\psi(r) = r$ , which is unbounded, so outliers have unlimited influence.

The absolute loss $\rho(r) = |r|$ yields the sample median. Its $\psi$ -function is $\psi(r) = \text{sign}(r)$ , which is bounded, giving robustness.

The Huber loss with threshold $c > 0$ is:

$\rho_c(r) = \begin{cases} r^2 / 2 & \text{if } |r| \leq c \\ c|r| - c^2/2 & \text{if } |r| > c \end{cases}$

Its $\psi$ -function is $\psi_c(r) = \max(-c, \min(r, c))$ . This acts like squared loss for small residuals (efficient) and like absolute loss for large residuals (robust). The parameter $c$ controls the tradeoff: $c = 1.345$ gives 95% efficiency at the Gaussian while still being robust.

Tukey's bisquare (biweight) goes further: $\psi(r) = 0$ for $|r| > c$ , completely ignoring extreme outliers. This gives a redescending $\psi$ -function. The tradeoff is that the optimization problem becomes non-convex.

Main Theorems

Theorem

Influence Function of an M-Estimator

Statement

For an M-estimator with $\psi$ -function $\psi$ , the influence function at distribution $F$ is:

$IF(x; T, F) = \frac{\psi(x - \theta)}{\int \psi'(y - \theta) \, dF(y)}$

where $\theta = T(F)$ is the true parameter value under $F$ .

Intuition

The numerator $\psi(x - \theta)$ is how hard a contamination point at $x$ pulls on the estimator. The denominator is a normalizing factor from the population. If $\psi$ is bounded, the influence function is bounded, and no single contamination point can move the estimator far.

Proof Sketch

Write the estimating equation at the contaminated distribution $F_\epsilon = (1 - \epsilon)F + \epsilon \delta_x$ . Differentiate with respect to $\epsilon$ at $\epsilon = 0$ using implicit differentiation. The result drops out directly from the chain rule applied to $\int \psi(y - T(F_\epsilon)) \, dF_\epsilon(y) = 0$ .

Why It Matters

This theorem is the main diagnostic tool for robustness. Before using an estimator in practice, compute its influence function. If the IF is unbounded, a single outlier can cause arbitrarily large bias. Bounded IF is the minimum requirement for robustness.

Failure Mode

The influence function is a local measure: it describes the effect of infinitesimal contamination. It does not tell you what happens when 10% of your data is corrupted. For that, you need the breakdown point.

report a correction →

Proposition

Maximum Breakdown Point

Statement

For any translation-equivariant estimator of a location parameter based on $n$ observations, the breakdown point satisfies:

$\epsilon^* \leq \frac{\lfloor (n+1)/2 \rfloor}{n}$

The sample median achieves this bound, so the maximum possible breakdown point is at most $1/2$ (with equality in the limit $n \to \infty$ ).

Intuition

If at least half the data is corrupted, the corrupted points can form a majority and dictate the estimate. No translation-equivariant estimator can survive corruption of the majority.

Proof Sketch

Donoho and Huber (1983). Replace $\lfloor (n+1)/2 \rfloor$ observations with copies of $M \to \infty$ . Any translation-equivariant estimator must follow these points to infinity because they now form a (weak) majority. For the sample median, replacing fewer than $\lfloor (n+1)/2 \rfloor$ points leaves the median pinned by the uncorrupted majority.

Why It Matters

This sets a fundamental limit on robustness. It also explains why the median is special: it achieves the highest possible breakdown point for location estimation.

report a correction →

Canonical Examples

Example

Huber M-estimator for location

Suppose you observe $X = (1.2, 0.8, 1.1, 0.9, 1.0, 50.0)$ . The sample mean is $9.17$ , dragged far from the bulk by the outlier at $50$ .

The Huber M-estimator with $c = 1.345$ downweights the outlier. Iteratively solving $\sum \psi_c(X_i - \hat{\theta}) = 0$ converges to $\hat{\theta} \approx 1.0$ , which reflects the bulk of the data. The outlier's contribution is clipped to $c = 1.345$ instead of pulling with force $48.83$ .

Example

Robust regression with Huber loss

In linear regression $y = X\beta + \epsilon$ , replace the squared loss with Huber loss:

$\hat{\beta} = \arg\min_{\beta} \sum_{i=1}^{n} \rho_c(y_i - x_i^T \beta)$

This is robust to outliers in $y$ (vertical outliers). For protection against leverage points (outliers in $x$ ), you need more sophisticated methods like MM-estimators or least trimmed squares.

Common Confusions

Watch Out

Robustness is not just about outlier removal

Robust estimators do not simply remove outliers and then apply classical methods. They use all the data but reweight observations continuously based on residual size. This is more principled: you do not need to choose a hard threshold for what counts as an outlier.

Watch Out

Efficiency and robustness are not mutually exclusive

A common misconception is that robust estimators are much less efficient at the Gaussian. The Huber estimator with $c = 1.345$ achieves 95% asymptotic efficiency at the Gaussian while having a breakdown point of roughly $1/n$ . For higher breakdown, use MM-estimators which achieve both high efficiency and $\epsilon^* = 1/2$ .

Summary

M-estimators generalize MLE by minimizing $\sum \rho(r_i)$ for a chosen loss function $\rho$
The influence function measures sensitivity to contamination: bounded IF means no single point can cause unbounded bias
Breakdown point measures the fraction of data that can be corrupted before the estimator fails; $1/2$ is the maximum
Huber loss is the workhorse: quadratic for small residuals, linear for large ones, controlled by threshold $c$
In ML, Huber loss and similar robust losses are standard in regression, reinforcement learning, and any setting with noisy labels

Exercises

ExerciseCore

Problem

Compute the influence function of the sample mean (i.e., the M-estimator with $\rho(r) = r^2 / 2$ ) and verify that it is unbounded.

ExerciseAdvanced

Problem

Show that the Huber $\psi$ -function $\psi_c(r) = \max(-c, \min(r, c))$ yields a bounded influence function. What is the maximum influence?

ExerciseResearch

Problem

The Huber M-estimator has breakdown point approximately $1/n$ (it can survive one outlier but not two). Explain why, and describe how MM-estimators achieve breakdown point $1/2$ while maintaining high Gaussian efficiency.

References

Canonical:

Huber & Ronchetti, Robust Statistics (2nd ed., 2009), Chapters 1-4
Hampel, Ronchetti, Rousseeuw, Stahel, Robust Statistics (1986)

Current:

Maronna, Martin, Yohai, Salibián-Barrera, Robust Statistics: Theory and Methods (2nd ed., 2019)
van der Vaart, Asymptotic Statistics (1998), Chapter 5 (M- and Z-estimators)
Diakonikolas & Kane, Algorithmic High-Dimensional Robust Statistics (2023), Chapters 1-3
Lugosi & Mendelson, "Mean estimation and regression under heavy-tailed distributions: a survey," FoCM 19 (2019), 1145-1190

Next Topics

Natural continuations from robust statistics:

Empirical risk minimization: the general framework for choosing loss functions
Hypothesis testing for ML: testing under model misspecification

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
Skewness, Kurtosis, and Higher Momentslayer 1 · tier 1
Minimax and Saddle Pointslayer 2 · tier 2
Winsorizationlayer 1 · tier 3

Derived topics

2

Empirical Risk Minimizationlayer 2 · tier 1
Hypothesis Testing for MLlayer 2 · tier 2

Graph-backed continuations

Empirical Risk Minimization Hypothesis Testing for ML