Signals and Systems for ML

Sneiderman, Robby

Foundations

Signals and Systems for ML

Linear time-invariant systems, convolution, Fourier transform, and the sampling theorem. The signal processing foundations that underpin CNNs, efficient attention, audio ML, and frequency-domain analysis of training dynamics.

CoreTier 2StableSupporting~60 min

Start 8-question practice · 5 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

foundations | layer 1 | tier 2. This page has 0 direct prerequisites and 6 published dependents.

Open Atlas Prerequisites Leads to

What next

Convolutional Neural Networks

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Convolution is everywhere in machine learning, and it comes directly from signal processing. When you use a CNN, you are applying a bank of linear time-invariant filters to your input. When you compute attention efficiently using FFT, you are exploiting the convolution theorem. When you work with audio, speech, or time-series data, you are doing signal processing whether you realize it or not.

Understanding signals and systems gives you the language to reason about:

Why CNNs work (translation equivariance from convolution)
How to make attention $O(n \log n)$ instead of $O(n^2)$
What happens when you discretize continuous data (aliasing, Nyquist)
Frequency-domain analysis of neural network training dynamics

Mental Model

A signal is a function that carries information. A system transforms signals. The key restriction that makes everything tractable is linearity and time invariance: if the system treats all time steps the same way and obeys superposition, then its entire behavior is characterized by a single function (the impulse response), and applying the system to any input reduces to convolution.

The Fourier transform converts convolution (expensive in time domain) into multiplication (cheap in frequency domain). This is the single most important computational trick in signal processing.

Formal Setup and Notation

We work with both continuous-time signals $x(t)$ where $t \in \mathbb{R}$ and discrete-time signals $x[n]$ where $n \in \mathbb{Z}$ .

Definition

Linear Time-Invariant (LTI) System

A system $T$ is linear if and only if $T\{ax_1 + bx_2\} = aT\{x_1\} + bT\{x_2\}$ for all signals $x_1, x_2$ and scalars $a, b$ .

A system is time-invariant if and only if shifting the input shifts the output by the same amount: if $y(t) = T\{x(t)\}$ , then $T\{x(t - t_0)\} = y(t - t_0)$ .

An LTI system is completely characterized by its impulse response $h(t)$ .

Definition

Convolution (Continuous)

The output of an LTI system with impulse response $h$ and input $x$ is:

$(x * h)(t) = \int_{-\infty}^{\infty} x(\tau) h(t - \tau) \, d\tau$

Convolution is commutative ( $x * h = h * x$ ), associative, and distributive over addition.

Definition

Convolution (Discrete)

For discrete signals:

$(x * h)[n] = \sum_{k=-\infty}^{\infty} x[k] \, h[n - k]$

This is exactly what a 1D convolutional layer computes (with finite support for the kernel $h$ ).

The Fourier Transform

Definition

Continuous Fourier Transform

The Fourier transform decomposes a signal into its frequency components:

$X(f) = \int_{-\infty}^{\infty} x(t) e^{-j2\pi ft} \, dt$

The inverse transform reconstructs the signal:

$x(t) = \int_{-\infty}^{\infty} X(f) e^{j2\pi ft} \, df$

Here $f$ is frequency in Hz and $j = \sqrt{-1}$ .

Definition

Discrete Fourier Transform (DFT)

For a finite sequence $x[0], x[1], \ldots, x[N-1]$ :

$X[k] = \sum_{n=0}^{N-1} x[n] e^{-j2\pi kn/N}, \quad k = 0, 1, \ldots, N-1$

The DFT can be computed in $O(N \log N)$ time using the Fast Fourier Transform (FFT) algorithm, compared to $O(N^2)$ for the naive computation.

Key properties of the Fourier transform:

Linearity: $\mathcal{F}\{ax + by\} = aX + bY$
Time shift: shifting $x(t)$ by $t_0$ multiplies $X(f)$ by $e^{-j2\pi ft_0}$
Parseval's theorem: energy in time domain equals energy in frequency domain

Main Theorems

Theorem

Convolution Theorem

Statement

Convolution in the time domain corresponds to pointwise multiplication in the frequency domain:

$\mathcal{F}\{x * h\} = X(f) \cdot H(f)$

Equivalently, multiplication in the time domain corresponds to convolution in the frequency domain:

$\mathcal{F}\{x \cdot h\} = X * H$

Intuition

An LTI system acts independently on each frequency component of the input. The frequency response $H(f)$ tells you how much each frequency is amplified or attenuated. This decomposition works because complex exponentials $e^{j2\pi ft}$ are eigenfunctions of LTI systems.

Proof Sketch

Substitute the convolution integral into the Fourier transform definition. Exchange the order of integration (justified by Fubini's theorem for absolutely integrable functions). Recognize the inner integral as $X(f)$ after the substitution $u = t - \tau$ in the inner integral, which introduces the factor $e^{-j2\pi f \tau}$ . The result factors as $X(f) \cdot H(f)$ .

Why It Matters

This theorem is why the FFT is so useful. To convolve two length- $N$ sequences: (1) FFT both ( $O(N \log N)$ each), (2) multiply pointwise ( $O(N)$ ), (3) inverse FFT ( $O(N \log N)$ ). Total: $O(N \log N)$ instead of $O(N^2)$ for direct convolution. This is used in efficient attention mechanisms, large-kernel CNNs, and signal processing pipelines.

Failure Mode

The convolution theorem applies to linear convolution, but the DFT computes circular convolution. To get linear convolution from the DFT, you must zero-pad the sequences to length $\geq N_1 + N_2 - 1$ .

report a correction →

Theorem

Nyquist-Shannon Sampling Theorem

Statement

A continuous-time signal $x(t)$ that is bandlimited to $B$ Hz (i.e., $X(f) = 0$ for $|f| > B$ ) can be perfectly reconstructed from its samples $x[n] = x(nT_s)$ if the sampling rate satisfies:

$f_s = \frac{1}{T_s} > 2B$

The rate $2B$ is called the Nyquist rate. Sampling exactly at the Nyquist rate ( $f_s = 2B$ ) fails for signals whose spectrum has a Dirac delta at $\pm B$ ; for example, $\cos(2\pi B t)$ sampled at its zero crossings produces all zeros and cannot be recovered. The strict inequality $f_s > 2B$ ensures reconstruction for all bandlimited signals. Reconstruction is given by:

$x(t) = \sum_{n=-\infty}^{\infty} x[n] \, \text{sinc}\left(\frac{t - nT_s}{T_s}\right)$

Intuition

A bandlimited signal has a finite amount of information per unit time (determined by its bandwidth). If you sample fast enough, you capture all that information. If you sample too slowly, high frequencies masquerade as low frequencies (aliasing), and you lose information irreversibly.

Proof Sketch

Sampling in the time domain corresponds to periodizing the spectrum with period $f_s$ . If $f_s \geq 2B$ , the copies of the spectrum do not overlap, and you can recover the original spectrum with an ideal low-pass filter. If $f_s < 2B$ , the copies overlap (alias), and perfect reconstruction is impossible.

Why It Matters

This theorem governs every analog-to-digital conversion: audio recording (44.1 kHz for 20 kHz bandwidth), medical imaging, sensor networks. In ML, it explains aliasing artifacts in CNNs when downsampling without proper anti-aliasing filters, and why strided convolutions can lose information.

Failure Mode

Real signals are never truly bandlimited (a bandlimited signal cannot have finite duration in time, and vice versa). In practice, we anti-alias by low-pass filtering before sampling, which removes high-frequency content rather than perfectly preserving it.

report a correction →

Proof Ideas and Templates Used

The two main proof techniques are:

Eigenfunction analysis: complex exponentials are eigenfunctions of LTI systems, which is why the Fourier basis diagonalizes convolution
Duality between domains: operations that are hard in one domain become easy in the other (convolution becomes multiplication, sampling becomes periodization)

Canonical Examples

Example

CNN as a bank of LTI filters

A 1D convolutional layer with $C_{\text{out}}$ filters of kernel size $K$ applies $C_{\text{out}}$ discrete convolutions to the input. Each filter has impulse response $h[n]$ for $n = 0, \ldots, K-1$ . The output is translation equivariant: shifting the input shifts the output by the same amount. This is exactly the LTI property, and it is why CNNs are effective for signals with spatial or temporal structure.

Example

FFT-based efficient sequence mixing

Standard self-attention computes $\text{Attention}(Q, K, V)$ in $O(n^2 d)$ time. Efficient attention replacements like S4 (Gu, Goel, Ré 2022) and Hyena (Poli et al. 2023) reformulate sequence mixing as long convolutions, computable via FFT in $O(n \log n \cdot d)$ time. This requires the mixing kernel to depend only on relative position, which is more restrictive than softmax attention but sufficient for many tasks. A separate line of work (FNet, Lee-Thorp et al. 2021) replaces attention with a fixed 2D DFT mixer entirely, without a convolutional interpretation.

Common Confusions

Watch Out

Convolution vs cross-correlation

In signal processing, convolution flips the kernel: $(x * h)[n] = \sum_k x[k] h[n-k]$ . In most deep learning frameworks, the "convolution" layer actually computes cross-correlation (no flip): $\sum_k x[k] h[n+k]$ . Since the kernel weights are learned, the flip is absorbed into the learned parameters. The distinction matters when analyzing pre-defined filters but not when training.

Watch Out

DFT vs continuous Fourier transform

The DFT operates on finite, discrete sequences and produces finite, discrete frequency representations. The continuous Fourier transform operates on continuous functions and produces continuous spectra. The DFT is a sampled version of the DTFT (Discrete-Time Fourier Transform), which itself is the discrete-time analog of the continuous Fourier transform.

Watch Out

Aliasing in CNNs

When a CNN uses strided convolution or max pooling to downsample, it can violate the sampling theorem if the feature maps contain frequencies above the new Nyquist rate. This causes aliasing: small shifts in the input can cause large changes in the output, hurting translation invariance. Anti-aliased CNNs (Zhang 2019) add a low-pass filter before downsampling.

Summary

An LTI system is completely characterized by its impulse response
Convolution computes the output of an LTI system given its impulse response and input
The Fourier transform converts convolution to multiplication: $O(N^2)$ becomes $O(N \log N)$ via FFT
The sampling theorem: sample at $\geq 2B$ to avoid aliasing
CNNs are banks of discrete convolutional filters
Many efficiency gains in ML come from the convolution theorem

Exercises

ExerciseCore

Problem

A continuous signal has frequency content up to 8 kHz. What is the minimum sampling rate required to avoid aliasing? If you sample at 12 kHz instead, what happens?

ExerciseCore

Problem

You want to convolve two sequences of lengths 100 and 50. Compare the computational cost of direct convolution vs FFT-based convolution.

ExerciseAdvanced

Problem

Explain why strided convolution in a CNN can violate the sampling theorem, and describe how anti-aliased downsampling fixes this. What is the computational cost of the fix?

References

Canonical:

Oppenheim & Willsky, Signals and Systems (1997), Chapters 2-5
Haykin & Van Veen, Signals and Systems (2003), Chapters 3-4

Current:

Zhang, "Making Convolutional Networks Shift-Invariant Again" (ICML 2019)
Gu, Goel, Ré, "Efficiently Modeling Long Sequences with Structured State Spaces" (S4, ICLR 2022)
Poli et al., "Hyena Hierarchy: Towards Larger Convolutional Language Models" (ICML 2023)
Lee-Thorp et al., "FNet: Mixing Tokens with Fourier Transforms" (NAACL 2022)

Next Topics

The natural next step from signals and systems:

Convolutional neural networks: applying convolution to learn features

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

0

No direct prerequisites are declared; this is treated as an entry point.

Derived topics

6

Convolutional Neural Networkslayer 3 · tier 2
Speech and Audio MLlayer 3 · tier 2
Wavelet Smoothinglayer 2 · tier 3
CNNs for Signal Feature Extractionlayer 4 · tier 3
RNNs for Signal Sequenceslayer 4 · tier 3

+1 more on the derived-topics page.

Graph-backed continuations

Convolutional Neural Networks CNNs for Signal Feature Extraction RNNs for Signal Sequences Speech and Audio ML SVM for RF Classification Wavelet Smoothing