Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Wavelet Smoothing

Wavelet transforms decompose signals into localized frequency components. Thresholding wavelet coefficients denoises signals while adapting to local smoothness, achieving minimax-optimal rates over Besov spaces.

AdvancedTier 3Stable~45 min
0

Why This Matters

Kernel smoothers and Fourier methods use a single bandwidth or frequency cutoff globally. This fails when the signal has varying smoothness: sharp edges in one region and smooth curves in another. Wavelets adapt to local structure because they are localized in both time and frequency. Thresholding wavelet coefficients provides a denoising method that is simultaneously simple, computationally efficient, and minimax-optimal over broad function classes.

Mental Model

Observe a noisy signal yi=f(xi)+ϵiy_i = f(x_i) + \epsilon_i where ff is unknown and ϵi\epsilon_i are i.i.d. noise drawn from a known distribution. The goal is to estimate ff.

The wavelet approach: (1) transform yy into wavelet coefficients, (2) shrink or zero out small coefficients (which are mostly noise), (3) invert the transform. Large coefficients correspond to genuine signal features. Small coefficients are dominated by noise.

Formal Setup

Definition

Multiresolution Analysis

A multiresolution analysis of L2(R)L^2(\mathbb{R}) is a nested sequence of closed subspaces:

{0}V1V0V1L2(R)\{0\} \subset \cdots \subset V_{-1} \subset V_0 \subset V_1 \subset \cdots \subset L^2(\mathbb{R})

satisfying: (1) jVj\bigcup_j V_j is dense in L2(R)L^2(\mathbb{R}), (2) jVj={0}\bigcap_j V_j = \{0\}, (3) f(x)Vj    f(2x)Vj+1f(x) \in V_j \iff f(2x) \in V_{j+1}, and (4) there exists a scaling function ϕ\phi such that {ϕ(xk)}kZ\{\phi(x - k)\}_{k \in \mathbb{Z}} is an orthonormal basis for V0V_0.

Definition

Wavelet Basis

Given an MRA, define the wavelet subspace WjW_j as the orthogonal complement of VjV_j in Vj+1V_{j+1}: Vj+1=VjWjV_{j+1} = V_j \oplus W_j. The wavelet function ψ\psi generates an orthonormal basis for WjW_j via:

ψj,k(x)=2j/2ψ(2jxk)\psi_{j,k}(x) = 2^{j/2} \psi(2^j x - k)

Any fL2(R)f \in L^2(\mathbb{R}) has the expansion f=jkdj,kψj,kf = \sum_j \sum_k d_{j,k} \, \psi_{j,k} where dj,k=f,ψj,kd_{j,k} = \langle f, \psi_{j,k} \rangle.

Common Wavelet Families

Haar wavelets. The simplest: ψ(x)=1\psi(x) = 1 on [0,1/2)[0, 1/2), ψ(x)=1\psi(x) = -1 on [1/2,1)[1/2, 1), zero elsewhere. Discontinuous, so only one vanishing moment. Good for piecewise-constant signals.

Daubechies wavelets. A family indexed by the number of vanishing moments NN. The Daubechies-NN wavelet has support length 2N12N - 1 and NN vanishing moments: xkψ(x)dx=0\int x^k \psi(x) \, dx = 0 for k=0,,N1k = 0, \ldots, N-1. More vanishing moments give better approximation of smooth functions but wider support.

The number of vanishing moments determines how quickly wavelet coefficients dj,kd_{j,k} decay with resolution level jj for smooth signals. If ff has ss continuous derivatives and ψ\psi has NsN \geq s vanishing moments, then dj,k=O(2j(s+1/2))|d_{j,k}| = O(2^{-j(s + 1/2)}).

Thresholding

Given noisy observations yi=f(xi)+σϵiy_i = f(x_i) + \sigma \epsilon_i with ϵiN(0,1)\epsilon_i \sim N(0,1), the empirical wavelet coefficients are d^j,k=dj,k+σzj,k\hat{d}_{j,k} = d_{j,k} + \sigma z_{j,k} where zj,kN(0,1)z_{j,k} \sim N(0, 1) by orthogonality.

Definition

Hard Thresholding

d^j,khard=d^j,k1(d^j,k>λ)\hat{d}_{j,k}^{\text{hard}} = \hat{d}_{j,k} \cdot \mathbf{1}(|\hat{d}_{j,k}| > \lambda)

Keep coefficients above λ\lambda, set the rest to zero.

Definition

Soft Thresholding

d^j,ksoft=sign(d^j,k)max(d^j,kλ,0)\hat{d}_{j,k}^{\text{soft}} = \text{sign}(\hat{d}_{j,k}) \cdot \max(|\hat{d}_{j,k}| - \lambda, 0)

Shrink all coefficients toward zero by λ\lambda.

The universal threshold λ=σ2logn\lambda = \sigma \sqrt{2 \log n} (where nn is the number of observations) has a clean justification: it is the level above which the maximum of nn i.i.d. standard normals is unlikely to exceed.

Main Theorems

Theorem

Minimax Optimality of Wavelet Thresholding

Statement

Consider the Gaussian white noise model dY(t)=f(t)dt+σndW(t)dY(t) = f(t)\,dt + \frac{\sigma}{\sqrt{n}}\,dW(t). Soft thresholding with universal threshold λ=σ2logn/n\lambda = \sigma \sqrt{2 \log n / n} achieves:

Ef^f22C(lognn)2s/(2s+1)\mathbb{E}\|\hat{f} - f\|_2^2 \leq C \left(\frac{\log n}{n}\right)^{2s/(2s+1)}

for ff in the Besov ball Bp,qs(M)B^s_{p,q}(M) with s>0s > 0 and p1p \geq 1. This rate is minimax-optimal up to the logn\log n factor.

Intuition

The rate n2s/(2s+1)n^{-2s/(2s+1)} is the standard nonparametric rate for estimating functions with ss degrees of smoothness. Wavelets achieve this rate simultaneously over a wide range of Besov spaces without knowing ss in advance. The logn\log n factor is the price of adaptation: a method that knew ss could avoid it, but thresholding achieves near-optimal rates without this knowledge.

Proof Sketch

Decompose the risk into bias and variance. For coefficients above the threshold, the variance contribution is O(σ2)O(\sigma^2) per retained coefficient. For coefficients below the threshold, the bias contribution is bounded by the coefficient magnitude. The universal threshold balances these: λ2=2σ2logn/n\lambda^2 = 2\sigma^2 \log n / n ensures that with high probability, all pure-noise coefficients fall below the threshold. The Besov space norm controls how many coefficients are large (signal) vs. small (noise).

Why It Matters

Wavelet thresholding is one of the few nonparametric methods that is simultaneously adaptive (no tuning of smoothness parameter) and minimax-optimal (up to log factors) over a broad class of function spaces. The risk bounds rely on concentration inequalities to control the noise coefficients. It achieves what kernel methods cannot: adapting to spatially varying smoothness.

Failure Mode

The result requires the noise to be Gaussian (or sub-Gaussian). For heavy-tailed noise, the universal threshold σ2logn\sigma\sqrt{2\log n} may not control the noise coefficients. The result also requires ff to lie in a Besov space, which excludes some function classes (e.g., functions with infinitely many discontinuities of growing amplitude).

Why Wavelets Beat Fourier for Irregular Signals

Fourier truncation keeps the first kk frequency components and discards the rest. This is optimal for globally smooth functions but fails at discontinuities: the Gibbs phenomenon creates oscillations near jumps, and the MSE rate drops from O(n2s/(2s+1))O(n^{-2s/(2s+1)}) for CsC^s functions to O(n2/3)O(n^{-2/3}) for piecewise smooth functions regardless of how smooth each piece is.

Wavelets avoid this because each wavelet coefficient is localized in both position and scale. A discontinuity at position x0x_0 affects only the coefficients dj,kd_{j,k} with k2jx0k \approx 2^j x_0. The coefficients far from the discontinuity decay rapidly. Thresholding retains the large coefficients near the discontinuity (capturing the edge) and discards the small coefficients elsewhere (removing noise).

Canonical Examples

Example

Denoising a piecewise-smooth signal

Consider f(x)=sin(4πx)1(x<0.5)+0.51(x0.5)f(x) = \sin(4\pi x) \cdot \mathbf{1}(x < 0.5) + 0.5 \cdot \mathbf{1}(x \geq 0.5). This function is smooth except for a jump at x=0.5x = 0.5.

Fourier truncation: retaining kk coefficients gives oscillations near x=0.5x = 0.5 (Gibbs phenomenon). The MSE is dominated by the discontinuity regardless of kk.

Wavelet thresholding (Daubechies-4, soft threshold): the large coefficients near x=0.5x = 0.5 are retained, preserving the jump. The smooth portions are denoised by removing small coefficients. The MSE adapts to the local smoothness of each region.

Common Confusions

Watch Out

Wavelets are not always better than Fourier

For globally smooth periodic functions (e.g., trigonometric series), Fourier methods achieve the optimal rate without the logn\log n penalty. Wavelets pay a logarithmic price for adaptation. If you know your signal is globally smooth, Fourier truncation is simpler and slightly better.

Watch Out

The universal threshold is conservative

The threshold σ2logn\sigma\sqrt{2\log n} controls the maximum of nn noise coefficients simultaneously. For any individual coefficient, a much smaller threshold would suffice. This conservatism is why wavelet thresholding sometimes over-smooths. Level-dependent thresholds or cross-validation can improve practical performance at the cost of simplicity.

Exercises

ExerciseCore

Problem

For n=1024n = 1024 observations with noise level σ=1\sigma = 1, compute the universal threshold λ\lambda. If the true function has a wavelet coefficient dj,k=0.5d_{j,k} = 0.5, would hard thresholding retain or discard it?

ExerciseAdvanced

Problem

Explain why the Haar wavelet is minimax-optimal over the Besov space B1,11B^1_{1,1} (functions of bounded variation) but not over B2,22B^2_{2,2} (Sobolev space H2H^2). What wavelet property determines the range of optimality?

References

Canonical:

  • Donoho & Johnstone, Ideal Spatial Adaptation by Wavelet Shrinkage, Biometrika (1994)
  • Mallat, A Wavelet Tour of Signal Processing (2009), Chapters 7, 11

Current:

  • Tsybakov, Introduction to Nonparametric Estimation (2009), Chapter 8

  • Johnstone, Gaussian Estimation: Sequence and Wavelet Models (2019)

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 3-15

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.