Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Foundations

Tensors and Tensor Operations

What a tensor actually is: a multilinear map with specific transformation rules, how tensor contraction generalizes matrix multiplication, Einstein summation, tensor decompositions, and how ML frameworks use the word tensor to mean multidimensional array.

CoreTier 1Stable~55 min

Why This Matters

Every neural network operates on tensors. A batch of images is a rank-4 tensor (batch, channels, height, width). Attention scores are rank-3 tensors. Weight matrices are rank-2 tensors. Understanding what tensors are, both the mathematical object and the computational object, is prerequisite to reading any ML paper or writing any training loop. The prerequisite linear algebra is covered in eigenvalues and eigenvectors.

Mental Model

A scalar is a single number. A vector is a list of numbers. A matrix is a grid of numbers. A tensor extends this pattern to arbitrary dimensions. But that description misses the point. A tensor is not just an array with more indices. It is a multilinear map that transforms in specific ways under changes of basis.

In ML practice, the distinction rarely matters because we work in fixed coordinate systems. But when you encounter differential geometry, general relativity, or continuum mechanics, the transformation rules are the entire point.

Formal Setup and Notation

Definition

Tensor (algebraic)

A tensor of order pp over vector spaces V1,,VpV_1, \ldots, V_p is a multilinear map:

T:V1××Vk×Vk+1××VpRT: V_1^* \times \cdots \times V_k^* \times V_{k+1} \times \cdots \times V_p \to \mathbb{R}

where ViV_i^* denotes the dual space of ViV_i. The tensor has kk contravariant indices and pkp - k covariant indices. In coordinates, TT is represented by a pp-dimensional array of components Tj1jpki1ikT^{i_1 \ldots i_k}_{j_1 \ldots j_{p-k}}.

Definition

Tensor order (rank)

The order of a tensor is the number of indices needed to specify a component. A scalar has order 0, a vector has order 1, a matrix has order 2, and so on. Some authors call this "rank," but that conflicts with the linear algebra notion of rank (dimension of the column space). To avoid confusion: "order" means number of indices, "rank" means something different (see tensor rank below).

Definition

Tensor contraction

Contraction sums over one upper and one lower index:

C(T)j1j^mi1i^k=sTj1si1sC(T)^{i_1 \ldots \hat{i}_k \ldots}_{j_1 \ldots \hat{j}_m \ldots} = \sum_{s} T^{i_1 \ldots s \ldots}_{j_1 \ldots s \ldots}

Matrix multiplication is contraction: (AB)ji=kAkiBjk(AB)^i_j = \sum_k A^i_k B^k_j. The trace is contraction of both indices: tr(A)=iAii\text{tr}(A) = \sum_i A^i_i.

Einstein Summation Convention

Repeated indices (one upper, one lower) are implicitly summed:

Cji=AkiBjkmeansCji=kAkiBjkC^i_j = A^i_k B^k_j \quad \text{means} \quad C^i_j = \sum_k A^i_k B^k_j

NumPy and PyTorch implement this with np.einsum and torch.einsum. The string notation maps directly: torch.einsum('ik,kj->ij', A, B) performs matrix multiplication.

Common einsum patterns:

  • Matrix multiply: ik,kj->ij
  • Batch matrix multiply: bik,bkj->bij
  • Trace: ii->
  • Outer product: i,j->ij
  • Attention scores: bqd,bkd->bqk

Tensor Decompositions

Definition

CP decomposition

The Canonical Polyadic (CP) decomposition writes an order-3 tensor as a sum of RR rank-1 tensors (outer products of vectors):

Tijkr=1RairbjrckrT_{ijk} \approx \sum_{r=1}^{R} a_{ir} \, b_{jr} \, c_{kr}

The minimal RR for exact decomposition is the tensor rank. Unlike matrix rank, tensor rank is NP-hard to compute and the best rank-RR approximation may not exist (the set of rank-RR tensors is not closed).

Definition

Tucker decomposition

The Tucker decomposition writes an order-3 tensor as:

TG×1U1×2U2×3U3T \approx G \times_1 U_1 \times_2 U_2 \times_3 U_3

where GG is a smaller core tensor and UiU_i are factor matrices. This generalizes the SVD to higher orders. Unlike CP, Tucker allows interactions between components via the core tensor GG.

The Computational Tensor: ML Frameworks

In PyTorch and NumPy, a "tensor" is a multidimensional array with:

  • Shape: tuple of dimensions, e.g., (32, 3, 224, 224) for a batch of images
  • dtype: data type (float32, float16, bfloat16, etc.)
  • device: where the data lives (cpu, cuda:0)
  • Strides: number of elements to skip in memory per index increment

This is the physicist/mathematician definition stripped of transformation rules. The ML tensor does not transform covariantly or contravariantly. It is an array with named dimensions and a compute device.

Broadcasting Rules

When operating on tensors of different shapes, broadcasting aligns dimensions from the right and expands size-1 dimensions:

  1. If tensors differ in number of dimensions, prepend size-1 dimensions to the smaller tensor.
  2. Dimensions of size 1 are stretched to match the other tensor.
  3. Dimensions must be equal or one of them must be 1.

Adding a vector of shape (3,) to a matrix of shape (4, 3) broadcasts the vector across all 4 rows. Broadcasting avoids explicit copies, saving memory.

Main Theorems

Theorem

Universality of Tensor Product

Statement

For any bilinear map f:V×WZf: V \times W \to Z, there exists a unique linear map f~:VWZ\tilde{f}: V \otimes W \to Z such that f(v,w)=f~(vw)f(v, w) = \tilde{f}(v \otimes w) for all vV,wWv \in V, w \in W.

Intuition

The tensor product is the "freest" bilinear construction. Every bilinear map factors through it. This is why tensors appear everywhere: they are the canonical way to combine vector spaces while preserving linearity in each argument separately.

Proof Sketch

Define f~\tilde{f} on elementary tensors by f~(vw)=f(v,w)\tilde{f}(v \otimes w) = f(v, w) and extend by linearity. Bilinearity of ff ensures this is well-defined (independent of how the tensor is written as a sum of elementary tensors). Uniqueness follows because elementary tensors span VWV \otimes W.

Why It Matters

This universal property is why tensors appear in physics (stress tensors, electromagnetic field tensor) and ML (multilinear maps between feature spaces). The tensor product is the right algebraic structure for anything multilinear.

Failure Mode

The universal property requires finite-dimensional spaces. In infinite dimensions, the algebraic tensor product is too small; you need a completed tensor product (Hilbert-Schmidt operators, nuclear spaces). This matters for kernel methods in RKHS but not for standard neural networks.

Canonical Examples

Example

Attention as tensor contraction

In self-attention, queries QQ, keys KK, and values VV are rank-3 tensors of shape (batch, seq_len, d_model). The attention score computation is:

scoresbqk=dQbqdKbkd/d\text{scores}_{bqk} = \sum_d Q_{bqd} \, K_{bkd} / \sqrt{d}

In einsum: bqd,bkd->bqk. This is a batched contraction over the embedding dimension. The output is a rank-3 tensor of attention logits.

Example

CP decomposition for weight compression

A weight tensor WW of shape (d1,d2,d3)(d_1, d_2, d_3) with d1d2d3d_1 d_2 d_3 parameters can be approximated by a rank-RR CP decomposition with R(d1+d2+d3)R(d_1 + d_2 + d_3) parameters. For di=100d_i = 100 and R=10R = 10, this reduces parameters from 10610^6 to 3,0003{,}000, a 333×333\times compression.

Common Confusions

Watch Out

Tensor order vs tensor rank

The order (number of indices) of a tensor is not the same as its rank (minimum number of rank-1 terms in a CP decomposition). A 3×33 \times 3 matrix has order 2 but could have rank anywhere from 0 to 3. The ML community often says "rank" when they mean "order." Watch for this.

Watch Out

ML tensors vs math tensors

A PyTorch tensor does not transform under change of basis. It is a multidimensional array, full stop. The mathematical tensor carries transformation rules (covariant or contravariant). In ML, this distinction almost never matters because we work in a fixed computational basis. If you move to differential geometry or physics, it matters a great deal.

Watch Out

Broadcasting is not free

Broadcasting creates no physical copies, but it does expand the computation. Adding a shape (1, 1000) tensor to a shape (1000, 1000) tensor performs 10610^6 additions, not 10310^3. Memory is saved, compute is not.

Summary

  • A tensor is a multilinear map. In coordinates, it is a multidimensional array that transforms according to specific rules under change of basis.
  • Order = number of indices. Rank = minimum CP decomposition terms.
  • Contraction generalizes matrix multiplication and trace.
  • Einstein summation (einsum) is the lingua franca for tensor operations.
  • CP and Tucker decompositions compress tensors, with applications in model compression and data analysis.
  • In ML frameworks, "tensor" means multidimensional array with shape, dtype, and device. No transformation rules are tracked.

Exercises

ExerciseCore

Problem

Write the einsum string for computing the Frobenius norm squared of a matrix AA, i.e., AF2=ijAij2\|A\|_F^2 = \sum_{ij} A_{ij}^2.

ExerciseAdvanced

Problem

A rank-3 tensor TR100×100×100T \in \mathbb{R}^{100 \times 100 \times 100} has 10610^6 entries. Suppose its CP rank is R=5R = 5. How many parameters does the CP decomposition use? By what factor is this compressed compared to storing TT directly?

References

Canonical:

  • Kolda & Bader, "Tensor Decompositions and Applications," SIAM Review 51(3), 2009
  • Lim, "Tensors and Hypermatrices," Chapter 15 in Handbook of Linear Algebra (2013)

Current:

  • Novikov et al., "Tensorizing Neural Networks," NeurIPS 2015
  • PyTorch documentation on torch.einsum

Next Topics

From tensors, the natural continuations are:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics