Foundations
Tensors and Tensor Operations
What a tensor actually is: a multilinear map with specific transformation rules, how tensor contraction generalizes matrix multiplication, Einstein summation, tensor decompositions, and how ML frameworks use the word tensor to mean multidimensional array.
Prerequisites
Why This Matters
Every neural network operates on tensors. A batch of images is a rank-4 tensor (batch, channels, height, width). Attention scores are rank-3 tensors. Weight matrices are rank-2 tensors. Understanding what tensors are, both the mathematical object and the computational object, is prerequisite to reading any ML paper or writing any training loop. The prerequisite linear algebra is covered in eigenvalues and eigenvectors.
Mental Model
A scalar is a single number. A vector is a list of numbers. A matrix is a grid of numbers. A tensor extends this pattern to arbitrary dimensions. But that description misses the point. A tensor is not just an array with more indices. It is a multilinear map that transforms in specific ways under changes of basis.
In ML practice, the distinction rarely matters because we work in fixed coordinate systems. But when you encounter differential geometry, general relativity, or continuum mechanics, the transformation rules are the entire point.
Formal Setup and Notation
Tensor (algebraic)
A tensor of order over vector spaces is a multilinear map:
where denotes the dual space of . The tensor has contravariant indices and covariant indices. In coordinates, is represented by a -dimensional array of components .
Tensor order (rank)
The order of a tensor is the number of indices needed to specify a component. A scalar has order 0, a vector has order 1, a matrix has order 2, and so on. Some authors call this "rank," but that conflicts with the linear algebra notion of rank (dimension of the column space). To avoid confusion: "order" means number of indices, "rank" means something different (see tensor rank below).
Tensor contraction
Contraction sums over one upper and one lower index:
Matrix multiplication is contraction: . The trace is contraction of both indices: .
Einstein Summation Convention
Repeated indices (one upper, one lower) are implicitly summed:
NumPy and PyTorch implement this with np.einsum and torch.einsum. The
string notation maps directly: torch.einsum('ik,kj->ij', A, B) performs
matrix multiplication.
Common einsum patterns:
- Matrix multiply:
ik,kj->ij - Batch matrix multiply:
bik,bkj->bij - Trace:
ii-> - Outer product:
i,j->ij - Attention scores:
bqd,bkd->bqk
Tensor Decompositions
CP decomposition
The Canonical Polyadic (CP) decomposition writes an order-3 tensor as a sum of rank-1 tensors (outer products of vectors):
The minimal for exact decomposition is the tensor rank. Unlike matrix rank, tensor rank is NP-hard to compute and the best rank- approximation may not exist (the set of rank- tensors is not closed).
Tucker decomposition
The Tucker decomposition writes an order-3 tensor as:
where is a smaller core tensor and are factor matrices. This generalizes the SVD to higher orders. Unlike CP, Tucker allows interactions between components via the core tensor .
The Computational Tensor: ML Frameworks
In PyTorch and NumPy, a "tensor" is a multidimensional array with:
- Shape: tuple of dimensions, e.g.,
(32, 3, 224, 224)for a batch of images - dtype: data type (
float32,float16,bfloat16, etc.) - device: where the data lives (
cpu,cuda:0) - Strides: number of elements to skip in memory per index increment
This is the physicist/mathematician definition stripped of transformation rules. The ML tensor does not transform covariantly or contravariantly. It is an array with named dimensions and a compute device.
Broadcasting Rules
When operating on tensors of different shapes, broadcasting aligns dimensions from the right and expands size-1 dimensions:
- If tensors differ in number of dimensions, prepend size-1 dimensions to the smaller tensor.
- Dimensions of size 1 are stretched to match the other tensor.
- Dimensions must be equal or one of them must be 1.
Adding a vector of shape (3,) to a matrix of shape (4, 3) broadcasts the
vector across all 4 rows. Broadcasting avoids explicit copies, saving memory.
Main Theorems
Universality of Tensor Product
Statement
For any bilinear map , there exists a unique linear map such that for all .
Intuition
The tensor product is the "freest" bilinear construction. Every bilinear map factors through it. This is why tensors appear everywhere: they are the canonical way to combine vector spaces while preserving linearity in each argument separately.
Proof Sketch
Define on elementary tensors by and extend by linearity. Bilinearity of ensures this is well-defined (independent of how the tensor is written as a sum of elementary tensors). Uniqueness follows because elementary tensors span .
Why It Matters
This universal property is why tensors appear in physics (stress tensors, electromagnetic field tensor) and ML (multilinear maps between feature spaces). The tensor product is the right algebraic structure for anything multilinear.
Failure Mode
The universal property requires finite-dimensional spaces. In infinite dimensions, the algebraic tensor product is too small; you need a completed tensor product (Hilbert-Schmidt operators, nuclear spaces). This matters for kernel methods in RKHS but not for standard neural networks.
Canonical Examples
Attention as tensor contraction
In self-attention, queries , keys , and values are rank-3 tensors
of shape (batch, seq_len, d_model). The attention score computation is:
In einsum: bqd,bkd->bqk. This is a batched contraction over the embedding
dimension. The output is a rank-3 tensor of attention logits.
CP decomposition for weight compression
A weight tensor of shape with parameters can be approximated by a rank- CP decomposition with parameters. For and , this reduces parameters from to , a compression.
Common Confusions
Tensor order vs tensor rank
The order (number of indices) of a tensor is not the same as its rank (minimum number of rank-1 terms in a CP decomposition). A matrix has order 2 but could have rank anywhere from 0 to 3. The ML community often says "rank" when they mean "order." Watch for this.
ML tensors vs math tensors
A PyTorch tensor does not transform under change of basis. It is a multidimensional array, full stop. The mathematical tensor carries transformation rules (covariant or contravariant). In ML, this distinction almost never matters because we work in a fixed computational basis. If you move to differential geometry or physics, it matters a great deal.
Broadcasting is not free
Broadcasting creates no physical copies, but it does expand the computation.
Adding a shape (1, 1000) tensor to a shape (1000, 1000) tensor performs
additions, not . Memory is saved, compute is not.
Summary
- A tensor is a multilinear map. In coordinates, it is a multidimensional array that transforms according to specific rules under change of basis.
- Order = number of indices. Rank = minimum CP decomposition terms.
- Contraction generalizes matrix multiplication and trace.
- Einstein summation (
einsum) is the lingua franca for tensor operations. - CP and Tucker decompositions compress tensors, with applications in model compression and data analysis.
- In ML frameworks, "tensor" means multidimensional array with shape, dtype, and device. No transformation rules are tracked.
Exercises
Problem
Write the einsum string for computing the Frobenius norm squared of a matrix , i.e., .
Problem
A rank-3 tensor has entries. Suppose its CP rank is . How many parameters does the CP decomposition use? By what factor is this compressed compared to storing directly?
References
Canonical:
- Kolda & Bader, "Tensor Decompositions and Applications," SIAM Review 51(3), 2009
- Lim, "Tensors and Hypermatrices," Chapter 15 in Handbook of Linear Algebra (2013)
Current:
- Novikov et al., "Tensorizing Neural Networks," NeurIPS 2015
- PyTorch documentation on
torch.einsum
Next Topics
From tensors, the natural continuations are:
- Feedforward networks and backpropagation: neural networks as compositions of tensor operations
- Principal component analysis: the SVD is the order-2 analog of tensor decomposition
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Eigenvalues and EigenvectorsLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A