Beyond Llms
Equivariant Deep Learning
Networks that respect symmetry: if the input transforms under a group action, the output transforms predictably. Equivariance generalizes translation equivariance in CNNs to rotations, permutations, and gauge symmetries, reducing sample complexity and improving generalization on structured data.
Prerequisites
Why This Matters
A CNN detects a cat whether it appears on the left or right of the image. This is translation equivariance: shifting the input shifts the feature maps by the same amount. The CNN does not need to learn the cat pattern separately for each position because weight sharing enforces the symmetry.
Equivariant deep learning generalizes this idea to arbitrary symmetries. If your data has rotational symmetry (molecular structures, satellite imagery), permutation symmetry (sets, graphs, point clouds), or gauge symmetry (physical fields), you can build networks that respect these symmetries by construction. The payoff: fewer parameters, less training data, better generalization.
This is the core idea of geometric deep learning (Bronstein et al., 2021): most successful architectures can be understood as equivariant networks for specific symmetry groups.
Core Definitions
Group Action
A group acts on a space through a map satisfying (identity) and (composition). Examples: translation group acts on images by shifting. Rotation group acts on 3D point clouds by rotating. Permutation group acts on sets by reordering.
Equivariance
Statement
A function is equivariant with respect to group if:
Transforming the input, then applying , gives the same result as applying , then transforming the output. The function "commutes" with the group action.
Invariance is the special case where is trivial: for all . The output does not change at all.
Intuition
An equivariant function preserves the structure of transformations. If you rotate a molecule 90 degrees and then predict its energy, you should get the same energy as if you first predict and then (conceptually) rotate. If you rotate it and predict its dipole moment, the dipole should rotate by the same 90 degrees.
Invariance (energy does not change under rotation) and equivariance (dipole rotates with the molecule) are both useful, and which one you want depends on what you are predicting.
Why It Matters
Equivariance is a hard constraint, not a soft regularizer. A network that is equivariant by construction will respect the symmetry perfectly on all inputs, not just approximately on training data. This is a strict generalization guarantee: the network cannot learn to violate the symmetry, even with adversarial data. This is why equivariant networks need dramatically less data than unconstrained networks for tasks with known symmetries.
Failure Mode
The symmetry must be exact. If your data has approximate symmetry (e.g., images are roughly but not exactly rotation-invariant because of gravity), enforcing exact equivariance can hurt. The network cannot learn that "up" and "down" are different if you force rotational invariance. In such cases, data augmentation (soft symmetry) may outperform equivariant architectures (hard symmetry).
Why Equivariance Reduces Parameters
Equivariance Implies Weight Sharing
Statement
A linear map that is equivariant with respect to representations and of satisfies:
The set of matrices satisfying this constraint is a linear subspace of . The dimension of this subspace (the number of free parameters) is at most for a finite group .
Intuition
The equivariance constraint forces parameter sharing. In a CNN, translation equivariance forces the same filter weights at every position, reducing parameters from to . For rotation equivariance, the constraint forces the filter to be "steerable" (a linear combination of a fixed set of basis filters), further reducing parameters.
Fewer free parameters means the function class is smaller, which improves generalization via the bias-variance tradeoff. The bias is increased (you cannot represent symmetry-breaking functions), but the variance decreases (less overfitting) by exactly the right amount when the symmetry holds.
Why It Matters
This is why equivariant networks work with less data: the parameter sharing from equivariance is not arbitrary compression, it is compression that perfectly matches the data symmetry. The number of parameters scales inversely with the group size, so larger symmetry groups give more compression. A rotation group (discrete rotations of a square) gives fewer parameters. A continuous rotation group gives infinite compression (parameters depend only on the radial profile, not the angle).
Failure Mode
Computing the equivariant subspace requires solving the intertwiner condition for all , which requires knowledge of the group representations. For simple groups (translations, rotations, permutations), the representations are well-known. For complex or non-standard symmetries, finding the representations is a research problem in itself.
Architectures as Equivariant Networks
| Architecture | Symmetry group | Equivariance type | Domain |
|---|---|---|---|
| CNN | Translation | Feature maps shift with input | Images |
| GNN | Permutation | Output permutes with node reordering | Graphs |
| Transformer | Permutation (on tokens) | Equivariant (with positional encoding: breaks symmetry) | Sequences |
| Steerable CNN | Rotation or | Feature maps rotate with input | Oriented images |
| SE(3)-Transformer | Rotation + translation | Equivariant on 3D coordinates | Molecules, proteins |
| SchNet / DimeNet | (Euclidean group) | Invariant predictions, equivariant internal features | Molecular dynamics |
| DeepSets | Permutation | Invariant to set element ordering | Point clouds, sets |
Common Confusions
Equivariance and invariance are different
Invariance means the output does not change under the group action (). Equivariance means the output transforms predictably (). Predicting molecular energy should be invariant to rotation. Predicting molecular forces should be equivariant (forces rotate with the molecule). Using the wrong one is a modeling error, not just a terminology issue.
Data augmentation is not the same as equivariance
Data augmentation (training on rotated/flipped copies of the data) encourages the network to learn approximate equivariance from data. An equivariant architecture enforces exact equivariance by construction. Augmentation needs more data and may not generalize to unseen transformations. Equivariance guarantees the symmetry holds everywhere. The tradeoff: augmentation is more flexible (works with approximate symmetries), equivariance is more efficient (works with exact symmetries).
Exercises
Problem
A function is invariant to the permutation group (any reordering of the input coordinates gives the same output). Give three examples of such functions and one example of a function that is not permutation-invariant.
Problem
Explain why a standard MLP (fully connected network) is not equivariant to any non-trivial group action on its inputs, while a CNN IS equivariant to translations. What structural property of the CNN enforces this?
References
Canonical:
- Cohen & Welling, "Group Equivariant Convolutional Networks" (ICML 2016). The foundational paper.
- Bronstein et al., "Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges" (2021). The unifying survey.
Current:
- Weiler & Cesa, "General E(2)-Equivariant Steerable CNNs" (NeurIPS 2019)
- Batzner et al., "E(3)-Equivariant Graph Neural Networks for Data-Efficient and Accurate Interatomic Potentials" (Nature Communications, 2022)
- Zaheer et al., "Deep Sets" (NeurIPS 2017). Permutation invariance.
Next Topics
- Riemannian optimization: optimization on manifolds where equivariance constraints define the geometry
- Representation learning: how learned representations encode (or fail to encode) data symmetries
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Convolutional Neural NetworksLayer 3
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A
- Vectors, Matrices, and Linear MapsLayer 0A
- Graph Neural NetworksLayer 3
- Eigenvalues and EigenvectorsLayer 0A