Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Algorithms Foundations

Fast Fourier Transform

The Cooley-Tukey FFT reduces the discrete Fourier transform from O(n^2) to O(n log n), enabling efficient convolution, spectral methods, and Fourier features for kernel approximation.

CoreTier 2Stable~50 min

Why This Matters

The FFT is one of the most important algorithms in all of computational science. For ML specifically: convolutions in CNNs can be computed via FFT, spectral clustering uses eigenvalues of matrices constructed with Fourier-like tools, random Fourier features approximate kernel functions by sampling from the Fourier transform of the kernel, and some efficient attention mechanisms replace quadratic dot-product attention with FFT-based operations.

The FFT takes O(n2)O(n^2) and makes it O(nlogn)O(n \log n). Any time you see convolution, correlation, or spectral analysis, the FFT is doing the work underneath.

The Discrete Fourier Transform

Definition

Discrete Fourier Transform

Given a sequence x0,x1,,xn1Cx_0, x_1, \ldots, x_{n-1} \in \mathbb{C}, the DFT produces a sequence X0,X1,,Xn1X_0, X_1, \ldots, X_{n-1} defined by:

Xk=j=0n1xje2πijk/n,k=0,1,,n1X_k = \sum_{j=0}^{n-1} x_j \, e^{-2\pi i \, jk/n}, \quad k = 0, 1, \ldots, n-1

The inverse DFT recovers xx from XX:

xj=1nk=0n1Xke2πijk/nx_j = \frac{1}{n} \sum_{k=0}^{n-1} X_k \, e^{2\pi i \, jk/n}

The DFT decomposes a signal into its frequency components. Each XkX_k measures how much the signal oscillates at frequency k/nk/n. The naive computation of all nn values of XkX_k requires nn multiplications each, totaling O(n2)O(n^2).

Definition

Roots of Unity

The nn-th root of unity is ωn=e2πi/n\omega_n = e^{-2\pi i/n}. The DFT can be written as Xk=j=0n1xjωnjkX_k = \sum_{j=0}^{n-1} x_j \omega_n^{jk}. The key algebraic property: ωnn=1\omega_n^n = 1 and ωnn/2=1\omega_n^{n/2} = -1 (when nn is even). These symmetries are what make the FFT possible.

The Cooley-Tukey FFT

The idea: split the DFT into two half-sized DFTs by separating even-indexed and odd-indexed terms.

Write xjx_j with even indices j=2mj = 2m and odd indices j=2m+1j = 2m+1:

Xk=m=0n/21x2mωn2mk+ωnkm=0n/21x2m+1ωn2mkX_k = \sum_{m=0}^{n/2-1} x_{2m} \, \omega_n^{2mk} + \omega_n^k \sum_{m=0}^{n/2-1} x_{2m+1} \, \omega_n^{2mk}

Since ωn2=ωn/2\omega_n^2 = \omega_{n/2}, each sum is a DFT of size n/2n/2. Let EkE_k be the DFT of the even-indexed elements and OkO_k the DFT of the odd-indexed elements. Then:

Xk=Ek+ωnkOk,Xk+n/2=EkωnkOkX_k = E_k + \omega_n^k O_k, \quad X_{k+n/2} = E_k - \omega_n^k O_k

This is the butterfly operation. Two DFTs of size n/2n/2 plus O(n)O(n) combining work.

Theorem

FFT Complexity

Statement

The Cooley-Tukey FFT computes the DFT of nn complex numbers in O(nlogn)O(n \log n) arithmetic operations.

Intuition

Each level of recursion halves the problem size and does O(n)O(n) work for the butterfly combines. There are log2n\log_2 n levels, so total work is O(nlogn)O(n \log n).

Proof Sketch

The recurrence is T(n)=2T(n/2)+O(n)T(n) = 2T(n/2) + O(n). By the Master theorem with a=2a = 2, b=2b = 2, d=1d = 1: since log22=1=d\log_2 2 = 1 = d, we get T(n)=O(nlogn)T(n) = O(n \log n).

Why It Matters

Reducing O(n2)O(n^2) to O(nlogn)O(n \log n) is the difference between feasible and infeasible for large signals. For n=106n = 10^6, this is a factor of roughly 50,00050{,}000. Without the FFT, digital signal processing, medical imaging, and spectral methods in ML would be impractical.

Failure Mode

The standard radix-2 Cooley-Tukey requires nn to be a power of 2. For other sizes, you can pad with zeros (which changes the implicit periodicity) or use mixed-radix FFT variants. Numerical precision also degrades slightly with very large nn due to accumulation of floating-point errors in the twiddle factors ωnk\omega_n^k.

The Convolution Theorem

This is the reason the FFT is so important for ML.

Definition

Circular Convolution

The circular convolution of two length-nn sequences xx and yy is:

(xy)k=j=0n1xjy(kj)modn(x * y)_k = \sum_{j=0}^{n-1} x_j \, y_{(k-j) \bmod n}

Naive computation: O(n2)O(n^2).

Theorem

Convolution Theorem

Statement

Let X=DFT(x)X = \text{DFT}(x) and Y=DFT(y)Y = \text{DFT}(y). Then:

DFT(xy)=XY\text{DFT}(x * y) = X \odot Y

where \odot denotes pointwise multiplication. Equivalently, xy=IDFT(XY)x * y = \text{IDFT}(X \odot Y).

Intuition

Convolution in the time domain becomes pointwise multiplication in the frequency domain. This converts an O(n2)O(n^2) operation into three O(nlogn)O(n \log n) FFTs plus O(n)O(n) pointwise multiplications.

Proof Sketch

Direct computation. Write out DFT(xy)k\text{DFT}(x * y)_k, substitute the convolution definition, swap the order of summation, and recognize the inner sum as XkYkX_k \cdot Y_k.

Why It Matters

This is the workhorse behind efficient convolution in signal processing and CNNs. Any convolution with a filter of length mm applied to a signal of length nn can be done in O(nlogn)O(n \log n) instead of O(nm)O(nm) by zero-padding to length n+m1n + m - 1 and using FFTs.

Failure Mode

The theorem applies to circular convolution. For linear convolution (which is what CNNs actually compute), you must zero-pad the inputs to avoid wraparound artifacts. Also, for small filter sizes mnm \ll n, direct convolution O(nm)O(nm) may beat FFT-based convolution O(nlogn)O(n \log n) due to FFT overhead.

Connections to ML

CNNs: Convolution layers apply filters to input feature maps. For large spatial dimensions, FFT-based convolution is faster than direct convolution. Libraries like cuDNN choose between direct and FFT implementations based on filter and input size.

Fourier features: The random Fourier features method (Rahimi and Recht, 2007) approximates a shift-invariant kernel k(xy)k(x-y) by sampling from the Fourier transform of kk. This relies on Bochner's theorem: a continuous shift-invariant positive definite kernel is the Fourier transform of a non-negative measure.

Efficient attention: Several methods (FNet, for example) replace the self-attention mechanism with Fourier transforms along the sequence dimension, achieving O(nlogn)O(n \log n) complexity instead of O(n2)O(n^2) for sequence length nn.

Spectral methods: Graph neural networks and spectral clustering use eigendecompositions and Fourier transforms on graphs. The graph Fourier transform generalizes the classical DFT to non-Euclidean domains.

Common Confusions

Watch Out

FFT computes the DFT, it is not a different transform

The FFT is an algorithm for computing the DFT. The DFT is the mathematical transform. The FFT and naive DFT produce exactly the same output. The FFT just does it faster.

Watch Out

Circular vs linear convolution

The convolution theorem gives circular convolution, not linear. In CNNs and signal processing, we typically want linear convolution. You must zero-pad to at least length n+m1n + m - 1 before applying the FFT to get the correct linear convolution result.

Watch Out

FFT is not always faster for small filters

For a 3x3 convolution filter on a 224x224 image (common in CNNs), direct convolution is faster than FFT. The FFT advantage kicks in for larger filters or when computing many convolutions simultaneously.

Key Takeaways

  • DFT: transforms time domain to frequency domain. Naive cost: O(n2)O(n^2)
  • FFT (Cooley-Tukey): computes DFT in O(nlogn)O(n \log n) by divide and conquer on even/odd indices
  • Convolution theorem: convolution becomes pointwise multiply in frequency domain
  • For ML: efficient convolution, random Fourier features, efficient attention, spectral methods
  • Practical consideration: FFT beats direct convolution only when filter or signal size is large enough

Exercises

ExerciseCore

Problem

Compute the DFT of the sequence x=(1,0,1,0)x = (1, 0, 1, 0) by hand. Verify that Xk=j=03xje2πijk/4X_k = \sum_{j=0}^{3} x_j e^{-2\pi i jk/4} for k=0,1,2,3k = 0, 1, 2, 3.

ExerciseAdvanced

Problem

You need to convolve a signal of length n=220106n = 2^{20} \approx 10^6 with a filter of length m=210103m = 2^{10} \approx 10^3. Compare the operation counts of direct convolution vs FFT-based convolution.

References

Canonical:

  • Cooley & Tukey, "An Algorithm for the Machine Calculation of Complex Fourier Series" (1965)
  • Cormen, Leiserson, Rivest, Stein, Introduction to Algorithms (CLRS), Chapter 30

Current:

  • Oppenheim & Willsky, Signals and Systems, Chapter 8
  • Rahimi & Recht, "Random Features for Large-Scale Kernel Machines" (2007)

Next Topics

  • Efficient attention variants: some use FFT to avoid quadratic attention cost
  • Convolutional neural networks: where convolution meets deep learning

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics