Foundations
Signals and Systems for ML
Linear time-invariant systems, convolution, Fourier transform, and the sampling theorem. The signal processing foundations that underpin CNNs, efficient attention, audio ML, and frequency-domain analysis of training dynamics.
Why This Matters
Convolution is everywhere in machine learning, and it comes directly from signal processing. When you use a CNN, you are applying a bank of linear time-invariant filters to your input. When you compute attention efficiently using FFT, you are exploiting the convolution theorem. When you work with audio, speech, or time-series data, you are doing signal processing whether you realize it or not.
Understanding signals and systems gives you the language to reason about:
- Why CNNs work (translation equivariance from convolution)
- How to make attention instead of
- What happens when you discretize continuous data (aliasing, Nyquist)
- Frequency-domain analysis of neural network training dynamics
Mental Model
A signal is a function that carries information. A system transforms signals. The key restriction that makes everything tractable is linearity and time invariance: if the system treats all time steps the same way and obeys superposition, then its entire behavior is characterized by a single function (the impulse response), and applying the system to any input reduces to convolution.
The Fourier transform converts convolution (expensive in time domain) into multiplication (cheap in frequency domain). This is the single most important computational trick in signal processing.
Formal Setup and Notation
We work with both continuous-time signals where and discrete-time signals where .
Linear Time-Invariant (LTI) System
A system is linear if for all signals and scalars .
A system is time-invariant if shifting the input shifts the output by the same amount: if , then .
An LTI system is completely characterized by its impulse response .
Convolution (Continuous)
The output of an LTI system with impulse response and input is:
Convolution is commutative (), associative, and distributive over addition.
Convolution (Discrete)
For discrete signals:
This is exactly what a 1D convolutional layer computes (with finite support for the kernel ).
The Fourier Transform
Continuous Fourier Transform
The Fourier transform decomposes a signal into its frequency components:
The inverse transform reconstructs the signal:
Here is frequency in Hz and .
Discrete Fourier Transform (DFT)
For a finite sequence :
The DFT can be computed in time using the Fast Fourier Transform (FFT) algorithm, compared to for the naive computation.
Key properties of the Fourier transform:
- Linearity:
- Time shift: shifting by multiplies by
- Parseval's theorem: energy in time domain equals energy in frequency domain
Main Theorems
Convolution Theorem
Statement
Convolution in the time domain corresponds to pointwise multiplication in the frequency domain:
Equivalently, multiplication in the time domain corresponds to convolution in the frequency domain:
Intuition
An LTI system acts independently on each frequency component of the input. The frequency response tells you how much each frequency is amplified or attenuated. This decomposition works because complex exponentials are eigenfunctions of LTI systems.
Proof Sketch
Substitute the convolution integral into the Fourier transform definition. Exchange the order of integration (justified by Fubini's theorem for absolutely integrable functions). Recognize the inner integral as after a change of variables. The result factors as .
Why It Matters
This theorem is why the FFT is so useful. To convolve two length- sequences: (1) FFT both ( each), (2) multiply pointwise (), (3) inverse FFT (). Total: instead of for direct convolution. This is used in efficient attention mechanisms, large-kernel CNNs, and signal processing pipelines.
Failure Mode
The convolution theorem applies to linear convolution, but the DFT computes circular convolution. To get linear convolution from the DFT, you must zero-pad the sequences to length .
Nyquist-Shannon Sampling Theorem
Statement
A continuous-time signal that is bandlimited to Hz (i.e., for ) can be perfectly reconstructed from its samples if the sampling rate satisfies:
The minimum sampling rate is called the Nyquist rate. Reconstruction is given by:
Intuition
A bandlimited signal has a finite amount of information per unit time (determined by its bandwidth). If you sample fast enough, you capture all that information. If you sample too slowly, high frequencies masquerade as low frequencies (aliasing), and you lose information irreversibly.
Proof Sketch
Sampling in the time domain corresponds to periodizing the spectrum with period . If , the copies of the spectrum do not overlap, and you can recover the original spectrum with an ideal low-pass filter. If , the copies overlap (alias), and perfect reconstruction is impossible.
Why It Matters
This theorem governs every analog-to-digital conversion: audio recording (44.1 kHz for 20 kHz bandwidth), medical imaging, sensor networks. In ML, it explains aliasing artifacts in CNNs when downsampling without proper anti-aliasing filters, and why strided convolutions can lose information.
Failure Mode
Real signals are never truly bandlimited (a bandlimited signal cannot have finite duration in time, and vice versa). In practice, we anti-alias by low-pass filtering before sampling, which removes high-frequency content rather than perfectly preserving it.
Proof Ideas and Templates Used
The two main proof techniques are:
- Eigenfunction analysis: complex exponentials are eigenfunctions of LTI systems, which is why the Fourier basis diagonalizes convolution
- Duality between domains: operations that are hard in one domain become easy in the other (convolution becomes multiplication, sampling becomes periodization)
Canonical Examples
CNN as a bank of LTI filters
A 1D convolutional layer with filters of kernel size applies discrete convolutions to the input. Each filter has impulse response for . The output is translation equivariant: shifting the input shifts the output by the same amount. This is exactly the LTI property, and it is why CNNs are effective for signals with spatial or temporal structure.
FFT-based efficient attention
Standard self-attention computes in time. Some efficient attention variants reformulate the computation as a convolution (e.g., when attention weights depend only on relative position). Using the convolution theorem, the convolution can be computed via FFT in time. This is the basis of approaches like FNet and certain linear attention mechanisms.
Common Confusions
Convolution vs cross-correlation
In signal processing, convolution flips the kernel: . In most deep learning frameworks, the "convolution" layer actually computes cross-correlation (no flip): . Since the kernel weights are learned, the flip is absorbed into the learned parameters. The distinction matters when analyzing pre-defined filters but not when training.
DFT vs continuous Fourier transform
The DFT operates on finite, discrete sequences and produces finite, discrete frequency representations. The continuous Fourier transform operates on continuous functions and produces continuous spectra. The DFT is a sampled version of the DTFT (Discrete-Time Fourier Transform), which itself is the discrete-time analog of the continuous Fourier transform.
Aliasing in CNNs
When a CNN uses strided convolution or max pooling to downsample, it can violate the sampling theorem if the feature maps contain frequencies above the new Nyquist rate. This causes aliasing: small shifts in the input can cause large changes in the output, hurting translation invariance. Anti-aliased CNNs (Zhang 2019) add a low-pass filter before downsampling.
Summary
- An LTI system is completely characterized by its impulse response
- Convolution computes the output of an LTI system given its impulse response and input
- The Fourier transform converts convolution to multiplication: becomes via FFT
- The sampling theorem: sample at to avoid aliasing
- CNNs are banks of discrete convolutional filters
- Many efficiency gains in ML come from the convolution theorem
Exercises
Problem
A continuous signal has frequency content up to 8 kHz. What is the minimum sampling rate required to avoid aliasing? If you sample at 12 kHz instead, what happens?
Problem
You want to convolve two sequences of lengths 100 and 50. Compare the computational cost of direct convolution vs FFT-based convolution.
Problem
Explain why strided convolution in a CNN can violate the sampling theorem, and describe how anti-aliased downsampling fixes this. What is the computational cost of the fix?
References
Canonical:
- Oppenheim & Willsky, Signals and Systems (1997), Chapters 2-5
- Haykin & Van Veen, Signals and Systems (2003), Chapters 3-4
Current:
- Zhang, "Making Convolutional Networks Shift-Invariant Again" (ICML 2019)
- Lee-Thorp et al., "FNet: Mixing Tokens with Fourier Transforms" (2022)
Next Topics
The natural next step from signals and systems:
- Convolutional neural networks: applying convolution to learn features
Last reviewed: April 2026
Builds on This
- Speech and Audio MLLayer 3
- Wavelet SmoothingLayer 2