Wavelet Smoothing

Sneiderman, Robby

ML Methods

Wavelet Smoothing

Wavelet transforms decompose signals into localized frequency components. Thresholding wavelet coefficients denoises signals while adapting to local smoothness, achieving minimax-optimal rates over Besov spaces.

AdvancedTier 3StableSupporting~45 min

Prerequisites

Signals and Systems for ML

Prereq Map

Why This Matters

Kernel smoothers and Fourier methods use a single bandwidth or frequency cutoff globally. This fails when the signal has varying smoothness: sharp edges in one region and smooth curves in another. Wavelets adapt to local structure because they are localized in both time and frequency. Thresholding wavelet coefficients provides a denoising method that is simultaneously simple, computationally efficient, and minimax-optimal over broad function classes.

Mental Model

Observe a noisy signal $y_i = f(x_i) + \epsilon_i$ where $f$ is unknown and $\epsilon_i$ are i.i.d. noise drawn from a known distribution. The goal is to estimate $f$ .

The wavelet approach: (1) transform $y$ into wavelet coefficients, (2) shrink or zero out small coefficients (which are mostly noise), (3) invert the transform. Large coefficients correspond to genuine signal features. Small coefficients are dominated by noise.

Formal Setup

Definition

Multiresolution Analysis $M R A$

A multiresolution analysis of $L^2(\mathbb{R})$ is a nested sequence of closed subspaces:

$\{0\} \subset \cdots \subset V_{-1} \subset V_0 \subset V_1 \subset \cdots \subset L^2(\mathbb{R})$

satisfying: (1) $\bigcup_j V_j$ is dense in $L^2(\mathbb{R})$ , (2) $\bigcap_j V_j = \{0\}$ , (3) $f(x) \in V_j \iff f(2x) \in V_{j+1}$ , and (4) there exists a scaling function $\phi$ such that $\{\phi(x - k)\}_{k \in \mathbb{Z}}$ is an orthonormal basis for $V_0$ .

Definition

Wavelet Basis

Given an MRA, define the wavelet subspace $W_j$ as the orthogonal complement of $V_j$ in $V_{j+1}$ : $V_{j+1} = V_j \oplus W_j$ . The wavelet function $\psi$ generates an orthonormal basis for $W_j$ via:

$\psi_{j,k}(x) = 2^{j/2} \psi(2^j x - k)$

Any $f \in L^2(\mathbb{R})$ has the expansion $f = \sum_j \sum_k d_{j,k} \, \psi_{j,k}$ where $d_{j,k} = \langle f, \psi_{j,k} \rangle$ .

Common Wavelet Families

Haar wavelets. The simplest: $\psi(x) = 1$ on $[0, 1/2)$ , $\psi(x) = -1$ on $[1/2, 1)$ , zero elsewhere. Discontinuous, so only one vanishing moment. Good for piecewise-constant signals.

Daubechies wavelets. A family indexed by the number of vanishing moments $N$ . The Daubechies- $N$ wavelet has support length $2N - 1$ and $N$ vanishing moments: $\int x^k \psi(x) \, dx = 0$ for $k = 0, \ldots, N-1$ . More vanishing moments give better approximation of smooth functions but wider support.

The number of vanishing moments determines how quickly wavelet coefficients $d_{j,k}$ decay with resolution level $j$ for smooth signals. If $f$ has $s$ continuous derivatives and $\psi$ has $N \geq s$ vanishing moments, then $|d_{j,k}| = O(2^{-j(s + 1/2)})$ .

Thresholding

Given noisy observations $y_i = f(x_i) + \sigma \epsilon_i$ with $\epsilon_i \sim N(0,1)$ , the empirical wavelet coefficients are $\hat{d}_{j,k} = d_{j,k} + \sigma z_{j,k}$ where $z_{j,k} \sim N(0, 1)$ by orthogonality.

Definition

Hard Thresholding

$\hat{d}_{j,k}^{\text{hard}} = \hat{d}_{j,k} \cdot \mathbf{1}(|\hat{d}_{j,k}| > \lambda)$

Keep coefficients above $\lambda$ , set the rest to zero.

Definition

Soft Thresholding

$\hat{d}_{j,k}^{\text{soft}} = \text{sign}(\hat{d}_{j,k}) \cdot \max(|\hat{d}_{j,k}| - \lambda, 0)$

Shrink all coefficients toward zero by $\lambda$ .

The universal threshold $\lambda = \sigma \sqrt{2 \log n}$ (where $n$ is the number of observations) has a clean justification: it is the level above which the maximum of $n$ i.i.d. standard normals is unlikely to exceed.

Main Theorems

Theorem

Minimax Optimality of Wavelet Thresholding

Statement

Consider the Gaussian white noise model $dY(t) = f(t)\,dt + \frac{\sigma}{\sqrt{n}}\,dW(t)$ . Soft thresholding with universal threshold $\lambda = \sigma \sqrt{2 \log n / n}$ achieves:

$\mathbb{E}\|\hat{f} - f\|_2^2 \leq C \left(\frac{\log n}{n}\right)^{2s/(2s+1)}$

for $f$ in the Besov ball $B^s_{p,q}(M)$ with $s > 0$ and $p \geq 1$ . This rate is minimax-optimal up to the $\log n$ factor.

Intuition

The rate $n^{-2s/(2s+1)}$ is the standard nonparametric rate for estimating functions with $s$ degrees of smoothness. Wavelets achieve this rate simultaneously over a wide range of Besov spaces without knowing $s$ in advance. The $\log n$ factor is the price of adaptation: a method that knew $s$ could avoid it, but thresholding achieves near-optimal rates without this knowledge.

Proof Sketch

Decompose the risk into bias and variance. For coefficients above the threshold, the variance contribution is $O(\sigma^2)$ per retained coefficient. For coefficients below the threshold, the bias contribution is bounded by the coefficient magnitude. The universal threshold balances these: $\lambda^2 = 2\sigma^2 \log n / n$ ensures that with high probability, all pure-noise coefficients fall below the threshold. The Besov space norm controls how many coefficients are large (signal) vs. small (noise).

Why It Matters

Wavelet thresholding is one of the few nonparametric methods that is simultaneously adaptive (no tuning of smoothness parameter) and minimax-optimal (up to log factors) over a broad class of function spaces. The risk bounds rely on concentration inequalities to control the noise coefficients. It achieves what kernel methods cannot: adapting to spatially varying smoothness.

Failure Mode

The result requires the noise to be Gaussian (or sub-Gaussian). For heavy-tailed noise, the universal threshold $\sigma\sqrt{2\log n}$ may not control the noise coefficients. The result also requires $f$ to lie in a Besov space, which excludes some function classes (e.g., functions with infinitely many discontinuities of growing amplitude).

report a correction →

Why Wavelets Beat Fourier for Irregular Signals

Fourier truncation keeps the first $k$ frequency components and discards the rest. This is optimal for globally smooth functions but fails at discontinuities: the Gibbs phenomenon creates oscillations near jumps, and the MSE rate drops from $O(n^{-2s/(2s+1)})$ for $C^s$ functions to $O(n^{-2/3})$ for piecewise smooth functions regardless of how smooth each piece is.

Wavelets avoid this because each wavelet coefficient is localized in both position and scale. A discontinuity at position $x_0$ affects only the coefficients $d_{j,k}$ with $k \approx 2^j x_0$ . The coefficients far from the discontinuity decay rapidly. Thresholding retains the large coefficients near the discontinuity (capturing the edge) and discards the small coefficients elsewhere (removing noise).

Canonical Examples

Example

Denoising a piecewise-smooth signal

Consider $f(x) = \sin(4\pi x) \cdot \mathbf{1}(x < 0.5) + 0.5 \cdot \mathbf{1}(x \geq 0.5)$ . This function is smooth except for a jump at $x = 0.5$ .

Fourier truncation: retaining $k$ coefficients gives oscillations near $x = 0.5$ (Gibbs phenomenon). The MSE is dominated by the discontinuity regardless of $k$ .

Wavelet thresholding (Daubechies-4, soft threshold): the large coefficients near $x = 0.5$ are retained, preserving the jump. The smooth portions are denoised by removing small coefficients. The MSE adapts to the local smoothness of each region.

Common Confusions

Watch Out

Wavelets are not always better than Fourier

For globally smooth periodic functions (e.g., trigonometric series), Fourier methods achieve the optimal rate without the $\log n$ penalty. Wavelets pay a logarithmic price for adaptation. If you know your signal is globally smooth, Fourier truncation is simpler and slightly better.

Watch Out

The universal threshold is conservative

The threshold $\sigma\sqrt{2\log n}$ controls the maximum of $n$ noise coefficients simultaneously. For any individual coefficient, a much smaller threshold would suffice. This conservatism is why wavelet thresholding sometimes over-smooths. Level-dependent thresholds or cross-validation can improve practical performance at the cost of simplicity.

Exercises

ExerciseCore

Problem

For $n = 1024$ observations with noise level $\sigma = 1$ , compute the universal threshold $\lambda$ . If the true function has a wavelet coefficient $d_{j,k} = 0.5$ , would hard thresholding retain or discard it?

ExerciseAdvanced

Problem

Explain why the Haar wavelet is minimax-optimal over the Besov space $B^1_{1,1}$ (functions of bounded variation) but not over $B^2_{2,2}$ (Sobolev space $H^2$ ). What wavelet property determines the range of optimality?

References

Canonical:

Donoho & Johnstone, Ideal Spatial Adaptation by Wavelet Shrinkage, Biometrika (1994)
Mallat, A Wavelet Tour of Signal Processing (2009), Chapters 7, 11

Current:

Tsybakov, Introduction to Nonparametric Estimation (2009), Chapter 8
Johnstone, Gaussian Estimation: Sequence and Wavelet Models (2019)

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Signals and Systems for MLlayer 1 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.