Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Foundations

Signals and Systems for ML

Linear time-invariant systems, convolution, Fourier transform, and the sampling theorem. The signal processing foundations that underpin CNNs, efficient attention, audio ML, and frequency-domain analysis of training dynamics.

CoreTier 2Stable~60 min
0

Why This Matters

Convolution is everywhere in machine learning, and it comes directly from signal processing. When you use a CNN, you are applying a bank of linear time-invariant filters to your input. When you compute attention efficiently using FFT, you are exploiting the convolution theorem. When you work with audio, speech, or time-series data, you are doing signal processing whether you realize it or not.

Understanding signals and systems gives you the language to reason about:

  • Why CNNs work (translation equivariance from convolution)
  • How to make attention O(nlogn)O(n \log n) instead of O(n2)O(n^2)
  • What happens when you discretize continuous data (aliasing, Nyquist)
  • Frequency-domain analysis of neural network training dynamics

Mental Model

A signal is a function that carries information. A system transforms signals. The key restriction that makes everything tractable is linearity and time invariance: if the system treats all time steps the same way and obeys superposition, then its entire behavior is characterized by a single function (the impulse response), and applying the system to any input reduces to convolution.

The Fourier transform converts convolution (expensive in time domain) into multiplication (cheap in frequency domain). This is the single most important computational trick in signal processing.

Formal Setup and Notation

We work with both continuous-time signals x(t)x(t) where tRt \in \mathbb{R} and discrete-time signals x[n]x[n] where nZn \in \mathbb{Z}.

Definition

Linear Time-Invariant (LTI) System

A system TT is linear if T{ax1+bx2}=aT{x1}+bT{x2}T\{ax_1 + bx_2\} = aT\{x_1\} + bT\{x_2\} for all signals x1,x2x_1, x_2 and scalars a,ba, b.

A system is time-invariant if shifting the input shifts the output by the same amount: if y(t)=T{x(t)}y(t) = T\{x(t)\}, then T{x(tt0)}=y(tt0)T\{x(t - t_0)\} = y(t - t_0).

An LTI system is completely characterized by its impulse response h(t)h(t).

Definition

Convolution (Continuous)

The output of an LTI system with impulse response hh and input xx is:

(xh)(t)=x(τ)h(tτ)dτ(x * h)(t) = \int_{-\infty}^{\infty} x(\tau) h(t - \tau) \, d\tau

Convolution is commutative (xh=hxx * h = h * x), associative, and distributive over addition.

Definition

Convolution (Discrete)

For discrete signals:

(xh)[n]=k=x[k]h[nk](x * h)[n] = \sum_{k=-\infty}^{\infty} x[k] \, h[n - k]

This is exactly what a 1D convolutional layer computes (with finite support for the kernel hh).

The Fourier Transform

Definition

Continuous Fourier Transform

The Fourier transform decomposes a signal into its frequency components:

X(f)=x(t)ej2πftdtX(f) = \int_{-\infty}^{\infty} x(t) e^{-j2\pi ft} \, dt

The inverse transform reconstructs the signal:

x(t)=X(f)ej2πftdfx(t) = \int_{-\infty}^{\infty} X(f) e^{j2\pi ft} \, df

Here ff is frequency in Hz and j=1j = \sqrt{-1}.

Definition

Discrete Fourier Transform (DFT)

For a finite sequence x[0],x[1],,x[N1]x[0], x[1], \ldots, x[N-1]:

X[k]=n=0N1x[n]ej2πkn/N,k=0,1,,N1X[k] = \sum_{n=0}^{N-1} x[n] e^{-j2\pi kn/N}, \quad k = 0, 1, \ldots, N-1

The DFT can be computed in O(NlogN)O(N \log N) time using the Fast Fourier Transform (FFT) algorithm, compared to O(N2)O(N^2) for the naive computation.

Key properties of the Fourier transform:

  • Linearity: F{ax+by}=aX+bY\mathcal{F}\{ax + by\} = aX + bY
  • Time shift: shifting x(t)x(t) by t0t_0 multiplies X(f)X(f) by ej2πft0e^{-j2\pi ft_0}
  • Parseval's theorem: energy in time domain equals energy in frequency domain

Main Theorems

Theorem

Convolution Theorem

Statement

Convolution in the time domain corresponds to pointwise multiplication in the frequency domain:

F{xh}=X(f)H(f)\mathcal{F}\{x * h\} = X(f) \cdot H(f)

Equivalently, multiplication in the time domain corresponds to convolution in the frequency domain:

F{xh}=XH\mathcal{F}\{x \cdot h\} = X * H

Intuition

An LTI system acts independently on each frequency component of the input. The frequency response H(f)H(f) tells you how much each frequency is amplified or attenuated. This decomposition works because complex exponentials ej2πfte^{j2\pi ft} are eigenfunctions of LTI systems.

Proof Sketch

Substitute the convolution integral into the Fourier transform definition. Exchange the order of integration (justified by Fubini's theorem for absolutely integrable functions). Recognize the inner integral as X(f)X(f) after a change of variables. The result factors as X(f)H(f)X(f) \cdot H(f).

Why It Matters

This theorem is why the FFT is so useful. To convolve two length-NN sequences: (1) FFT both (O(NlogN)O(N \log N) each), (2) multiply pointwise (O(N)O(N)), (3) inverse FFT (O(NlogN)O(N \log N)). Total: O(NlogN)O(N \log N) instead of O(N2)O(N^2) for direct convolution. This is used in efficient attention mechanisms, large-kernel CNNs, and signal processing pipelines.

Failure Mode

The convolution theorem applies to linear convolution, but the DFT computes circular convolution. To get linear convolution from the DFT, you must zero-pad the sequences to length N1+N21\geq N_1 + N_2 - 1.

Theorem

Nyquist-Shannon Sampling Theorem

Statement

A continuous-time signal x(t)x(t) that is bandlimited to BB Hz (i.e., X(f)=0X(f) = 0 for f>B|f| > B) can be perfectly reconstructed from its samples x[n]=x(nTs)x[n] = x(nT_s) if the sampling rate satisfies:

fs=1Ts2Bf_s = \frac{1}{T_s} \geq 2B

The minimum sampling rate 2B2B is called the Nyquist rate. Reconstruction is given by:

x(t)=n=x[n]sinc(tnTsTs)x(t) = \sum_{n=-\infty}^{\infty} x[n] \, \text{sinc}\left(\frac{t - nT_s}{T_s}\right)

Intuition

A bandlimited signal has a finite amount of information per unit time (determined by its bandwidth). If you sample fast enough, you capture all that information. If you sample too slowly, high frequencies masquerade as low frequencies (aliasing), and you lose information irreversibly.

Proof Sketch

Sampling in the time domain corresponds to periodizing the spectrum with period fsf_s. If fs2Bf_s \geq 2B, the copies of the spectrum do not overlap, and you can recover the original spectrum with an ideal low-pass filter. If fs<2Bf_s < 2B, the copies overlap (alias), and perfect reconstruction is impossible.

Why It Matters

This theorem governs every analog-to-digital conversion: audio recording (44.1 kHz for 20 kHz bandwidth), medical imaging, sensor networks. In ML, it explains aliasing artifacts in CNNs when downsampling without proper anti-aliasing filters, and why strided convolutions can lose information.

Failure Mode

Real signals are never truly bandlimited (a bandlimited signal cannot have finite duration in time, and vice versa). In practice, we anti-alias by low-pass filtering before sampling, which removes high-frequency content rather than perfectly preserving it.

Proof Ideas and Templates Used

The two main proof techniques are:

  1. Eigenfunction analysis: complex exponentials are eigenfunctions of LTI systems, which is why the Fourier basis diagonalizes convolution
  2. Duality between domains: operations that are hard in one domain become easy in the other (convolution becomes multiplication, sampling becomes periodization)

Canonical Examples

Example

CNN as a bank of LTI filters

A 1D convolutional layer with CoutC_{\text{out}} filters of kernel size KK applies CoutC_{\text{out}} discrete convolutions to the input. Each filter has impulse response h[n]h[n] for n=0,,K1n = 0, \ldots, K-1. The output is translation equivariant: shifting the input shifts the output by the same amount. This is exactly the LTI property, and it is why CNNs are effective for signals with spatial or temporal structure.

Example

FFT-based efficient attention

Standard self-attention computes Attention(Q,K,V)\text{Attention}(Q, K, V) in O(n2d)O(n^2 d) time. Some efficient attention variants reformulate the computation as a convolution (e.g., when attention weights depend only on relative position). Using the convolution theorem, the convolution can be computed via FFT in O(nlognd)O(n \log n \cdot d) time. This is the basis of approaches like FNet and certain linear attention mechanisms.

Common Confusions

Watch Out

Convolution vs cross-correlation

In signal processing, convolution flips the kernel: (xh)[n]=kx[k]h[nk](x * h)[n] = \sum_k x[k] h[n-k]. In most deep learning frameworks, the "convolution" layer actually computes cross-correlation (no flip): kx[k]h[n+k]\sum_k x[k] h[n+k]. Since the kernel weights are learned, the flip is absorbed into the learned parameters. The distinction matters when analyzing pre-defined filters but not when training.

Watch Out

DFT vs continuous Fourier transform

The DFT operates on finite, discrete sequences and produces finite, discrete frequency representations. The continuous Fourier transform operates on continuous functions and produces continuous spectra. The DFT is a sampled version of the DTFT (Discrete-Time Fourier Transform), which itself is the discrete-time analog of the continuous Fourier transform.

Watch Out

Aliasing in CNNs

When a CNN uses strided convolution or max pooling to downsample, it can violate the sampling theorem if the feature maps contain frequencies above the new Nyquist rate. This causes aliasing: small shifts in the input can cause large changes in the output, hurting translation invariance. Anti-aliased CNNs (Zhang 2019) add a low-pass filter before downsampling.

Summary

  • An LTI system is completely characterized by its impulse response
  • Convolution computes the output of an LTI system given its impulse response and input
  • The Fourier transform converts convolution to multiplication: O(N2)O(N^2) becomes O(NlogN)O(N \log N) via FFT
  • The sampling theorem: sample at 2B\geq 2B to avoid aliasing
  • CNNs are banks of discrete convolutional filters
  • Many efficiency gains in ML come from the convolution theorem

Exercises

ExerciseCore

Problem

A continuous signal has frequency content up to 8 kHz. What is the minimum sampling rate required to avoid aliasing? If you sample at 12 kHz instead, what happens?

ExerciseCore

Problem

You want to convolve two sequences of lengths 100 and 50. Compare the computational cost of direct convolution vs FFT-based convolution.

ExerciseAdvanced

Problem

Explain why strided convolution in a CNN can violate the sampling theorem, and describe how anti-aliased downsampling fixes this. What is the computational cost of the fix?

References

Canonical:

  • Oppenheim & Willsky, Signals and Systems (1997), Chapters 2-5
  • Haykin & Van Veen, Signals and Systems (2003), Chapters 3-4

Current:

  • Zhang, "Making Convolutional Networks Shift-Invariant Again" (ICML 2019)
  • Lee-Thorp et al., "FNet: Mixing Tokens with Fourier Transforms" (2022)

Next Topics

The natural next step from signals and systems:

Last reviewed: April 2026

Builds on This

Next Topics