Tensors and Tensor Operations

Sneiderman, Robby

Foundations

Tensors and Tensor Operations

What a tensor actually is: a multilinear map with specific transformation rules, how tensor contraction generalizes matrix multiplication, Einstein summation, tensor decompositions, and how ML frameworks use the word tensor to mean multidimensional array.

CoreTier 1StableSupporting~55 min

Prerequisites

Eigenvalues and Eigenvectors Pandas and Numpy Fundamentals

Start 8-question practice · 4 available Prereq Map

Learning position

Read this page in the graph.

foundations | layer 0A | tier 1. This page has 2 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Feedforward Networks and Backpropagation

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Every neural network operates on tensors. A batch of images is a rank-4 tensor (batch, channels, height, width). Attention scores are rank-3 tensors. Weight matrices are rank-2 tensors. Understanding what tensors are, both the mathematical object and the computational object, is prerequisite to reading any ML paper or writing any training loop. The prerequisite linear algebra is covered in eigenvalues and eigenvectors.

The tensor-product symbol appears in formal definitions; in code, tensors are most often treated as .

Mental Model

A scalar is a single number. A vector is a list of numbers. A matrix is a grid of numbers. A tensor extends this pattern to arbitrary dimensions. But that description misses the point. A tensor is not just an array with more indices. It is a multilinear map that transforms in specific ways under changes of basis.

In ML practice, the distinction rarely matters because we work in fixed coordinate systems. But when you encounter differential geometry, general relativity, or continuum mechanics, the transformation rules are the entire point.

Formal Setup and Notation

Definition

Tensor (algebraic) $T \in V_{1} \otimes \dots \otimes V_{p}$

A tensor of order $p$ over vector spaces $V_1, \ldots, V_p$ is a multilinear map:

$T: V_1^* \times \cdots \times V_k^* \times V_{k+1} \times \cdots \times V_p \to \mathbb{R}$

where $V_i^*$ denotes the dual space of $V_i$ . The tensor has $k$ contravariant indices and $p - k$ covariant indices. In coordinates, $T$ is represented by a $p$ -dimensional array of components $T^{i_1 \ldots i_k}_{j_1 \ldots j_{p-k}}$ .

Definition

Tensor order (rank)

The order of a tensor is the number of indices needed to specify a component. A scalar has order 0, a vector has order 1, a matrix has order 2, and so on. Some authors call this "rank," but that conflicts with the linear algebra notion of rank (dimension of the column space). To avoid confusion: "order" means number of indices, "rank" means something different (see tensor rank below).

Definition

Tensor contraction

Contraction sums over one upper and one lower index:

$C(T)^{i_1 \ldots \hat{i}_k \ldots}_{j_1 \ldots \hat{j}_m \ldots} = \sum_{s} T^{i_1 \ldots s \ldots}_{j_1 \ldots s \ldots}$

Matrix multiplication is contraction: $(AB)^i_j = \sum_k A^i_k B^k_j$ . The trace is contraction of both indices: $\text{tr}(A) = \sum_i A^i_i$ .

Einstein Summation Convention

Einsum: kept indices appear in the output; the repeated index is summed

Repeated indices (one upper, one lower) are implicitly summed:

$C^i_j = A^i_k B^k_j \quad \text{means} \quad C^i_j = \sum_k A^i_k B^k_j$

NumPy and PyTorch implement this with np.einsum and torch.einsum. The string notation maps directly: torch.einsum('ik,kj->ij', A, B) performs matrix multiplication.

Common einsum patterns:

Matrix multiply: ik,kj->ij
Batch matrix multiply: bik,bkj->bij
Trace: ii->
Outer product: i,j->ij
Attention scores: bqd,bkd->bqk

Reading einsum strings systematically

Each letter is an index. Letters appearing on both the left and the right of -> are free indices (they appear in the output). Letters appearing only on the left are contracted (summed over). The order of free indices on the right determines the output shape.

Examples with explicit shapes:

ij,jk->ik: inputs $(m \times n)$ and $(n \times p)$ , output $(m \times p)$ . Standard matrix multiplication.
ij,ij->: inputs $(m \times n)$ and $(m \times n)$ , output scalar. Frobenius inner product $\langle A, B \rangle_F = \sum_{ij} A_{ij} B_{ij}$ .
ij,kl->ijkl: outer product, output $(m \times n \times p \times q)$ .
bij,bjk->bik: batched matrix multiply, output $(b \times m \times p)$ .
ijk,ijk->: full contraction, output scalar. Useful for computing the Frobenius norm of an order-3 tensor.
ij->ji: transpose, swaps the two indices.

The notation also makes eigenvalue computations explicit. The characteristic polynomial involves $\det(A - \lambda I)$ ; the trace $\text{tr}(A) = \sum_i A_{ii}$ in einsum notation is ii->, which contracts the two indices of a square matrix. The sum of eigenvalues equals the trace; this is why torch.einsum('ii->', A) computes the trace of A.

Why einsum notation matters for ML code

Many numerical errors in deep learning come from incorrect axis ordering in tensor operations. Einsum makes the contraction structure explicit, eliminating transposition bugs. Consider computing the quadratic form $x^\top A x$ for a batch of vectors $x \in \mathbb{R}^{b \times d}$ and a shared matrix $A \in \mathbb{R}^{d \times d}$ : torch.einsum('bi,ij,bj->b', x, A, x) contracts both the $i$ and $j$ indices, producing a scalar per batch element. The naive approach using torch.bmm would require broadcasting $A$ explicitly, introducing a potential shape mismatch.

This is particularly relevant for attention: the operation $Q K^\top / \sqrt{d}$ in a multi-head attention block involves tensors of shape $(b, h, q, d_k)$ and $(b, h, k, d_k)$ , and einsum('bhqd,bhkd->bhqk', Q, K) captures the contraction precisely.

Tensor Decompositions

Definition

CP decomposition $T \approx \sum_{r = 1}^{R} a_{r} \circ b_{r} \circ c_{r}$

The Canonical Polyadic (CP) decomposition writes an order-3 tensor as a sum of $R$ rank-1 tensors (outer products of vectors):

$T_{ijk} \approx \sum_{r=1}^{R} a_{ir} \, b_{jr} \, c_{kr}$

The minimal $R$ for exact decomposition is the tensor rank. Unlike matrix rank, tensor rank is NP-hard to compute and the best rank- $R$ approximation may not exist (the set of rank- $R$ tensors is not closed).

Definition

Tucker decomposition

The Tucker decomposition writes an order-3 tensor as:

$T \approx G \times_1 U_1 \times_2 U_2 \times_3 U_3$

where $G$ is a smaller core tensor and $U_i$ are factor matrices. This generalizes the SVD to higher orders. Unlike CP, Tucker allows interactions between components via the core tensor $G$ .

The Computational Tensor: ML Frameworks

In PyTorch and NumPy, a "tensor" is a multidimensional array with:

Shape: tuple of dimensions, e.g., (32, 3, 224, 224) for a batch of images
dtype: data type (float32, float16, bfloat16, etc.)
device: where the data lives (cpu, cuda:0)
Strides: number of elements to skip in memory per index increment

This is the physicist/mathematician definition stripped of transformation rules. The ML tensor does not transform covariantly or contravariantly. It is an array with named dimensions and a compute device.

Broadcasting Rules

When operating on tensors of different shapes, broadcasting aligns dimensions from the right and expands size-1 dimensions:

If tensors differ in number of dimensions, prepend size-1 dimensions to the smaller tensor.
Dimensions of size 1 are stretched to match the other tensor.
Dimensions must be equal or one of them must be 1.

Adding a vector of shape (3,) to a matrix of shape (4, 3) broadcasts the vector across all 4 rows. Broadcasting avoids explicit copies, saving memory.

Main Theorems

Theorem

Universality of Tensor Product

Statement

For any bilinear map $f: V \times W \to Z$ , there exists a unique linear map $\tilde{f}: V \otimes W \to Z$ such that $f(v, w) = \tilde{f}(v \otimes w)$ for all $v \in V, w \in W$ .

Intuition

The tensor product is the "freest" bilinear construction. Every bilinear map factors through it. This is why tensors appear everywhere: they are the canonical way to combine vector spaces while preserving linearity in each argument separately.

Proof Sketch

Define $\tilde{f}$ on elementary tensors by $\tilde{f}(v \otimes w) = f(v, w)$ and extend by linearity. Bilinearity of $f$ ensures this is well-defined (independent of how the tensor is written as a sum of elementary tensors). Uniqueness follows because elementary tensors span $V \otimes W$ .

Why It Matters

This universal property is why tensors appear in physics (stress tensors, electromagnetic field tensor) and ML (multilinear maps between feature spaces). The tensor product is the right algebraic structure for anything multilinear.

Failure Mode

The algebraic universal property holds for arbitrary vector spaces, including infinite-dimensional ones. What changes in the infinite-dimensional analytic setting is that the algebraic tensor product is typically not complete with respect to natural norms. In Hilbert and Banach spaces one introduces completed tensor products (Hilbert-Schmidt, projective $\hat\otimes_\pi$ , injective $\hat\otimes_\varepsilon$ , nuclear) when topology, convergence, or operator-norm structure is required. This matters for kernel methods in RKHS and operator-valued kernels but not for standard neural networks operating on finite-dimensional activations.

report a correction →

Canonical Examples

Example

Attention as tensor contraction

In self-attention, queries $Q$ , keys $K$ , and values $V$ are rank-3 tensors of shape (batch, seq_len, d_model). The attention score computation is:

$\text{scores}_{bqk} = \sum_d Q_{bqd} \, K_{bkd} / \sqrt{d}$

In einsum: bqd,bkd->bqk. This is a batched contraction over the embedding dimension. The output is a rank-3 tensor of attention logits.

Example

CP decomposition for weight compression

A weight tensor $W$ of shape $(d_1, d_2, d_3)$ with $d_1 d_2 d_3$ parameters can be approximated by a rank- $R$ CP decomposition with $R(d_1 + d_2 + d_3)$ parameters. For $d_i = 100$ and $R = 10$ , this reduces parameters from $10^6$ to $3{,}000$ , a $333\times$ compression.

Common Confusions

Watch Out

Tensor order vs tensor rank

The order (number of indices) of a tensor is not the same as its rank (minimum number of rank-1 terms in a CP decomposition). A $3 \times 3$ matrix has order 2 but could have rank anywhere from 0 to 3. The ML community often says "rank" when they mean "order." Watch for this.

Watch Out

ML tensors vs math tensors

A PyTorch tensor does not transform under change of basis. It is a multidimensional array, full stop. The mathematical tensor carries transformation rules (covariant or contravariant). In ML, this distinction almost never matters because we work in a fixed computational basis. If you move to differential geometry or physics, it matters a great deal.

Watch Out

Broadcasting is not free

Broadcasting creates no physical copies, but it does expand the computation. Adding a shape (1, 1000) tensor to a shape (1000, 1000) tensor performs $10^6$ additions, not $10^3$ . Memory is saved, compute is not.

Summary

A tensor is a multilinear map. In coordinates, it is a multidimensional array that transforms according to specific rules under change of basis.
Order = number of indices. Rank = minimum CP decomposition terms.
Contraction generalizes matrix multiplication and trace.
Einstein summation (einsum) is the lingua franca for tensor operations.
CP and Tucker decompositions compress tensors, with applications in model compression and data analysis.
In ML frameworks, "tensor" means multidimensional array with shape, dtype, and device. No transformation rules are tracked.

Exercises

ExerciseCore

Problem

Write the einsum string for computing the Frobenius norm squared of a matrix $A$ , i.e., $\|A\|_F^2 = \sum_{ij} A_{ij}^2$ .

ExerciseAdvanced

Problem

A rank-3 tensor $T \in \mathbb{R}^{100 \times 100 \times 100}$ has $10^6$ entries. Suppose its CP rank is $R = 5$ . How many parameters does the CP decomposition use? By what factor is this compressed compared to storing $T$ directly?

References

Canonical:

Kolda & Bader, "Tensor Decompositions and Applications," SIAM Review 51(3), 2009. Sections 2-3 for CP and Tucker decompositions
Lim, "Tensors and Hypermatrices," Chapter 15 in Handbook of Linear Algebra (2013)
Einstein, "Die Grundlage der allgemeinen Relativitätstheorie," Annalen der Physik 49, 1916. Original summation convention
Bourbaki, Algebra I, Chapter 3. Universal property of tensor products in full generality

Current:

Novikov et al., "Tensorizing Neural Networks," NeurIPS 2015
PyTorch documentation on torch.einsum
Goodfellow, Bengio, Courville, Deep Learning (2016), Chapter 2. Computational tensor operations for ML
Vaswani et al., "Attention Is All You Need," NeurIPS 2017. Einsum patterns in multi-head attention

Next Topics

From tensors, the natural continuations are:

Feedforward networks and backpropagation: neural networks as compositions of tensor operations
Principal component analysis: the SVD is the order-2 analog of tensor decomposition
Matrix operations and properties: matrices as order-2 tensors; determinants, traces, and inverses as contraction operations
Eigenvalues and eigenvectors: spectral structure that generalizes to tensor decompositions via the Tucker core

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Eigenvalues and Eigenvectorslayer 0A · tier 1
Pandas and NumPy Fundamentalslayer 4 · tier 3

Derived topics

2

Principal Component Analysislayer 1 · tier 1
Feedforward Networks and Backpropagationlayer 2 · tier 1

Graph-backed continuations

Feedforward Networks and Backpropagation Principal Component Analysis