ML Methods
Wavelet Smoothing
Wavelet transforms decompose signals into localized frequency components. Thresholding wavelet coefficients denoises signals while adapting to local smoothness, achieving minimax-optimal rates over Besov spaces.
Prerequisites
Why This Matters
Kernel smoothers and Fourier methods use a single bandwidth or frequency cutoff globally. This fails when the signal has varying smoothness: sharp edges in one region and smooth curves in another. Wavelets adapt to local structure because they are localized in both time and frequency. Thresholding wavelet coefficients provides a denoising method that is simultaneously simple, computationally efficient, and minimax-optimal over broad function classes.
Mental Model
Observe a noisy signal where is unknown and are i.i.d. noise drawn from a known distribution. The goal is to estimate .
The wavelet approach: (1) transform into wavelet coefficients, (2) shrink or zero out small coefficients (which are mostly noise), (3) invert the transform. Large coefficients correspond to genuine signal features. Small coefficients are dominated by noise.
Formal Setup
Multiresolution Analysis
A multiresolution analysis of is a nested sequence of closed subspaces:
satisfying: (1) is dense in , (2) , (3) , and (4) there exists a scaling function such that is an orthonormal basis for .
Wavelet Basis
Given an MRA, define the wavelet subspace as the orthogonal complement of in : . The wavelet function generates an orthonormal basis for via:
Any has the expansion where .
Common Wavelet Families
Haar wavelets. The simplest: on , on , zero elsewhere. Discontinuous, so only one vanishing moment. Good for piecewise-constant signals.
Daubechies wavelets. A family indexed by the number of vanishing moments . The Daubechies- wavelet has support length and vanishing moments: for . More vanishing moments give better approximation of smooth functions but wider support.
The number of vanishing moments determines how quickly wavelet coefficients decay with resolution level for smooth signals. If has continuous derivatives and has vanishing moments, then .
Thresholding
Given noisy observations with , the empirical wavelet coefficients are where by orthogonality.
Hard Thresholding
Keep coefficients above , set the rest to zero.
Soft Thresholding
Shrink all coefficients toward zero by .
The universal threshold (where is the number of observations) has a clean justification: it is the level above which the maximum of i.i.d. standard normals is unlikely to exceed.
Main Theorems
Minimax Optimality of Wavelet Thresholding
Statement
Consider the Gaussian white noise model . Soft thresholding with universal threshold achieves:
for in the Besov ball with and . This rate is minimax-optimal up to the factor.
Intuition
The rate is the standard nonparametric rate for estimating functions with degrees of smoothness. Wavelets achieve this rate simultaneously over a wide range of Besov spaces without knowing in advance. The factor is the price of adaptation: a method that knew could avoid it, but thresholding achieves near-optimal rates without this knowledge.
Proof Sketch
Decompose the risk into bias and variance. For coefficients above the threshold, the variance contribution is per retained coefficient. For coefficients below the threshold, the bias contribution is bounded by the coefficient magnitude. The universal threshold balances these: ensures that with high probability, all pure-noise coefficients fall below the threshold. The Besov space norm controls how many coefficients are large (signal) vs. small (noise).
Why It Matters
Wavelet thresholding is one of the few nonparametric methods that is simultaneously adaptive (no tuning of smoothness parameter) and minimax-optimal (up to log factors) over a broad class of function spaces. The risk bounds rely on concentration inequalities to control the noise coefficients. It achieves what kernel methods cannot: adapting to spatially varying smoothness.
Failure Mode
The result requires the noise to be Gaussian (or sub-Gaussian). For heavy-tailed noise, the universal threshold may not control the noise coefficients. The result also requires to lie in a Besov space, which excludes some function classes (e.g., functions with infinitely many discontinuities of growing amplitude).
Why Wavelets Beat Fourier for Irregular Signals
Fourier truncation keeps the first frequency components and discards the rest. This is optimal for globally smooth functions but fails at discontinuities: the Gibbs phenomenon creates oscillations near jumps, and the MSE rate drops from for functions to for piecewise smooth functions regardless of how smooth each piece is.
Wavelets avoid this because each wavelet coefficient is localized in both position and scale. A discontinuity at position affects only the coefficients with . The coefficients far from the discontinuity decay rapidly. Thresholding retains the large coefficients near the discontinuity (capturing the edge) and discards the small coefficients elsewhere (removing noise).
Canonical Examples
Denoising a piecewise-smooth signal
Consider . This function is smooth except for a jump at .
Fourier truncation: retaining coefficients gives oscillations near (Gibbs phenomenon). The MSE is dominated by the discontinuity regardless of .
Wavelet thresholding (Daubechies-4, soft threshold): the large coefficients near are retained, preserving the jump. The smooth portions are denoised by removing small coefficients. The MSE adapts to the local smoothness of each region.
Common Confusions
Wavelets are not always better than Fourier
For globally smooth periodic functions (e.g., trigonometric series), Fourier methods achieve the optimal rate without the penalty. Wavelets pay a logarithmic price for adaptation. If you know your signal is globally smooth, Fourier truncation is simpler and slightly better.
The universal threshold is conservative
The threshold controls the maximum of noise coefficients simultaneously. For any individual coefficient, a much smaller threshold would suffice. This conservatism is why wavelet thresholding sometimes over-smooths. Level-dependent thresholds or cross-validation can improve practical performance at the cost of simplicity.
Exercises
Problem
For observations with noise level , compute the universal threshold . If the true function has a wavelet coefficient , would hard thresholding retain or discard it?
Problem
Explain why the Haar wavelet is minimax-optimal over the Besov space (functions of bounded variation) but not over (Sobolev space ). What wavelet property determines the range of optimality?
References
Canonical:
- Donoho & Johnstone, Ideal Spatial Adaptation by Wavelet Shrinkage, Biometrika (1994)
- Mallat, A Wavelet Tour of Signal Processing (2009), Chapters 7, 11
Current:
-
Tsybakov, Introduction to Nonparametric Estimation (2009), Chapter 8
-
Johnstone, Gaussian Estimation: Sequence and Wavelet Models (2019)
-
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 3-15
-
Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Signals and Systems for MLLayer 1