Skip to main content

ML Methods

Hyperbolic Embeddings for Graphs

Negatively curved space embeds tree-like graphs with exponentially less distortion than Euclidean space at the same dimension. The Poincare ball model, Mobius gyrovector arithmetic, and Sala et al.'s sharp distortion bound make this concrete; Ganea et al. lift the construction to neural network layers.

AdvancedTier 2Current~30 min
0

Why This Matters

A perfect binary tree of depth rr has 2r+112^{r+1}-1 nodes, so the number of nodes within graph distance rr of the root grows like 2r2^r. A Euclidean ball of radius rr in Rd\mathbb{R}^d has volume Θ(rd)\Theta(r^d), polynomial in rr. To fit 2r2^r nodes into a polynomial volume while preserving distances, an embedding must compress: tree leaves get shoved together and pairwise distances are crushed. This is not a tuning issue. It is a volume mismatch between the data and the space.

Hyperbolic space Hn\mathbb{H}^n has volume that grows like e(n1)re^{(n-1)r} (see non-Euclidean and hyperbolic geometry). The exponential matches the tree. Sala, De Sa, Gu, and Re (2018) made this quantitative: any tree on NN nodes embeds into the Poincare disk H2\mathbb{H}^2 with distortion 1+ε1+\varepsilon using O((logN)/ε)O((\log N)/\varepsilon) bits of precision, while the best Euclidean embedding into any fixed dimension has distortion Ω(logN)\Omega(\log N) for some trees.

This was not folklore. Nickel and Kiela (2017) reported 5-dimensional Poincare embeddings of WordNet hypernymy beating 200-dimensional Euclidean embeddings on link prediction. Ganea, Becigneul, and Hofmann (2018) then showed how to build neural-network layers (linear maps, biases, activations, attention) that operate inside Bn\mathbb{B}^n rather than around it.

Setup: the Poincare ball

Definition

Poincare ball model

The Poincare ball is the open unit ball

Bn={xRn:x<1}\mathbb{B}^n = \{x \in \mathbb{R}^n : \|x\| < 1\}

equipped with the Riemannian metric

gx=λx2geucl,λx=21x2.g_x = \lambda_x^2 \, g_{\mathrm{eucl}}, \qquad \lambda_x = \frac{2}{1-\|x\|^2}.

The induced distance between x,yBnx,y \in \mathbb{B}^n is

dB(x,y)=cosh1 ⁣(1+2xy2(1x2)(1y2)).d_{\mathbb{B}}(x,y) = \cosh^{-1}\!\left(1 + \frac{2\|x-y\|^2}{(1-\|x\|^2)(1-\|y\|^2)}\right).

The boundary {x:x=1}\{x : \|x\|=1\} is at infinite distance from any interior point. The model has constant sectional curvature 1-1.

Two facts make Bn\mathbb{B}^n usable as an embedding space:

  1. Geodesics through the origin are straight lines. The geodesic from 00 to a point x0x \neq 0 is the Euclidean segment, with hyperbolic arclength dB(0,x)=2arctanh(x)d_{\mathbb{B}}(0,x) = 2\,\mathrm{arctanh}(\|x\|).
  2. Other geodesics are circular arcs orthogonal to the boundary Bn\partial \mathbb{B}^n. Picture: a "straight line" curves toward the center because pushing along the boundary is metrically expensive (the conformal factor λx\lambda_x \to \infty as x1\|x\| \to 1).

The distortion bound for tree embeddings

The central result. Distortion of an embedding f:(V,dT)(M,dM)f: (V, d_T) \to (M, d_M) is the multiplicative gap between graph and embedded distances:

dist(f)=supuvdM(f(u),f(v))dT(u,v)supuvdT(u,v)dM(f(u),f(v)).\mathrm{dist}(f) = \sup_{u\neq v} \frac{d_M(f(u), f(v))}{d_T(u,v)} \cdot \sup_{u\neq v} \frac{d_T(u,v)}{d_M(f(u), f(v))}.

A perfect (isometric) embedding has distortion 11.

Theorem

Sala-De Sa-Gu-Re distortion bound for tree embeddings

Intuition

Embed the root at the origin. For each node already placed, reserve an angular cone for each child and place the child at hyperbolic distance equal to the edge weight along the bisector of its cone. Because the circumference of a hyperbolic circle of radius rr scales like sinhrer\sinh r \sim e^r, the cone available to each child is exponentially larger than the cone of the parent, so even a high-branching tree fits without children colliding. Scaling the whole embedding by a large τ\tau pushes nodes outward where there is exponential room, controlling angular interference.

Proof Sketch

Embedding construction (Sala et al., Algorithm 1). Place the root v0v_0 at 0B20 \in \mathbb{B}^2. Process nodes in BFS order. For a node uu at hyperbolic distance rur_u from the origin with kk children, partition the angular cone at uu (the 2π2\pi minus the angle back to its parent) into kk sectors. Place child cjc_j at hyperbolic distance ru+w(u,cj)r_u + w(u,c_j) along the bisector of its sector, where ww is the edge weight.

Why the construction has low distortion. Two nodes u,vu,v in TT have tree distance dT(u,v)=ru+rv2rlcad_T(u,v) = r_u + r_v - 2 r_{\mathrm{lca}}, where lca\mathrm{lca} is their lowest common ancestor. After scaling by τ1\tau \geq 1, the embedded points satisfy dB(f(u),f(v))τdT(u,v)O(logkmax)d_{\mathbb{B}}(f(u), f(v)) \approx \tau d_T(u,v) - O(\log k_{\max}), where kmaxk_{\max} is the max degree. The negative correction term is the angular wiggle from approximating "go up then down" by a hyperbolic geodesic. It is bounded uniformly in NN, so for τ\tau large enough relative to logkmax/ε\log k_{\max}/\varepsilon, every pairwise ratio lies in [1,1+ε][1, 1+\varepsilon].

Why precision O(logN/ε)O(\log N / \varepsilon) bits is needed. Hyperbolic distances near the boundary saturate floating-point precision. A point at hyperbolic distance RR has Euclidean norm tanh(R/2)=12eR+O(e2R)\tanh(R/2) = 1 - 2 e^{-R} + O(e^{-2R}), so distinguishing two points at hyperbolic distance RR apart requires Ω(R)\Omega(R) bits to keep 1x21 - \|x\|^2 from rounding to zero. Scaling pushes typical embedded distances to τdiam(T)=Ω(logN/ε)\tau \cdot \mathrm{diam}(T) = \Omega(\log N / \varepsilon), hence the bound. Sala et al. give an exact ulp-level analysis.

Euclidean lower bound. Bourgain (1985) and Linial-London-Rabinovich (1995) show any NN-point metric embeds into 2\ell_2 with distortion O(logN)O(\log N), and this is tight: the complete binary tree (or random regular expander) requires Ω(logN)\Omega(\log N) distortion in any Euclidean space. So scaling dimension at fixed distortion does not save Euclidean embeddings of trees: distortion Ω(logN)\Omega(\log N) is unavoidable in Rd\mathbb{R}^d for every dd.

Why It Matters

This is the only result that cleanly justifies "use hyperbolic space for hierarchies." The constructive direction tells you the minimum dimension and precision needed. The Euclidean lower bound tells you that no clever Euclidean parametrization can match it asymptotically. Together they explain Nickel and Kiela's 5D-vs-200D result without appealing to mystery: the Euclidean baseline was paying Ω(logN)\Omega(\log N) distortion, which a 200-dimensional bag-of-features partially absorbed but did not remove.

Failure Mode

The theorem is about trees and tree-like graphs (low Gromov hyperbolicity δ\delta). Grids, lattices, and dense graphs have δ=Θ(diam)\delta = \Theta(\mathrm{diam}) and gain nothing. The same construction applied to a 2D grid embeds it with distortion Ω(N)\Omega(\sqrt{N}), much worse than its trivial Euclidean embedding. Always estimate δ\delta on a sample of 4-tuples before committing to a hyperbolic embedding (Adcock-Sullivan-Mahoney 2013 give a sampling estimator).

Mobius gyrovector arithmetic

To build neural networks in Bn\mathbb{B}^n you need an analogue of vector addition that respects the manifold. Ungar's (2008) gyrovector formalism provides one.

Definition

Mobius addition on the Poincare ball

For x,yBnx,y \in \mathbb{B}^n, the Mobius sum is

xy=(1+2x,y+y2)x+(1x2)y1+2x,y+x2y2.x \oplus y = \frac{(1 + 2\langle x,y\rangle + \|y\|^2) x + (1 - \|x\|^2) y}{1 + 2\langle x,y\rangle + \|x\|^2 \|y\|^2}.

It is not associative or commutative, but it is a smooth left-loop operation: x(x)=0x \oplus (-x) = 0, 0x=x0 \oplus x = x, and the map yxyy \mapsto x \oplus y is a smooth bijection BnBn\mathbb{B}^n \to \mathbb{B}^n. The inverse is y=y\ominus y = -y.

Definition

Exponential and logarithmic maps at the origin

The exponential map exp0:T0BnRnBn\exp_0 : T_0 \mathbb{B}^n \cong \mathbb{R}^n \to \mathbb{B}^n at the origin is

exp0(v)=tanh(v)vv,\exp_0(v) = \tanh(\|v\|) \, \frac{v}{\|v\|},

and the logarithm is its inverse:

log0(y)=arctanh(y)yy.\log_0(y) = \mathrm{arctanh}(\|y\|) \, \frac{y}{\|y\|}.

exp0\exp_0 takes a tangent vector at 00 and returns the point in Bn\mathbb{B}^n reached by moving along the geodesic from 00 in that direction for time equal to the tangent vector's length.

A Ganea-style hyperbolic linear layer is then

y=exp0 ⁣(Wlog0(x))b,y = \exp_0\!\left( W \, \log_0(x) \right) \oplus b,

where WRm×nW \in \mathbb{R}^{m \times n} and bBmb \in \mathbb{B}^m. The construction reads as: lift xx to the tangent space at the origin, apply a Euclidean linear map, project back to the manifold, then translate by Mobius addition. Hyperbolic activations, attention scores, and MLR classifiers follow the same lift-act-project pattern.

Watch Out

Hyperbolic distance is not a rescaling of Euclidean distance

The Poincare distance formula is not just xy\|x-y\| scaled by a function of position. Two points with small Euclidean separation near the boundary can be very far apart in dBd_{\mathbb{B}} because λx\lambda_x \to \infty as x1\|x\| \to 1. This is why hyperbolic embeddings have capacity: the boundary is at infinity, so there is always more room to push deeper subtrees outward.

Watch Out

Hyperbolic neural networks are not Euclidean networks with a tanh squash

The output of tanh\tanh lies in (1,1)(-1,1), so a Euclidean MLP with a final tanh produces points in Bn\mathbb{B}^n as a set. That is not a hyperbolic neural network. The intermediate operations (matrix multiplication, addition, normalization) treat the points as Euclidean and ignore the metric. A hyperbolic network uses Mobius operations and Riemannian gradients computed via exp0\exp_0 and log0\log_0, so each parameter step respects the geometry. Treating points as Euclidean, then re-projecting after each step, loses exactly the structure the model was supposed to exploit.

Watch Out

Curvature is a hyperparameter, not a fixed truth

Most papers set curvature K=1K=-1 for convenience. Curvature scales distances and gradients in ways that interact with optimization and numerical precision. Chami et al. (2019) make KK learnable per layer; the learned values often drift toward zero on graphs with low Gromov δ\delta but high local clustering, indicating that the model wants Euclidean geometry in places. If your model insists on K0K \approx 0 everywhere, that is evidence the dataset is not actually tree-like.

When hyperbolic embeddings help

The right diagnostic is Gromov hyperbolicity δ\delta, the smallest constant such that every geodesic triangle is δ\delta-thin (each side lies within δ\delta of the union of the other two). Trees have δ=0\delta=0; trees with extra cross-edges have small δ\delta; expanders and 2D grids have δ=Θ(diam)\delta = \Theta(\mathrm{diam}).

Empirically:

Graph familyTypical δ/diam\delta / \mathrm{diam}Hyperbolic gain over Euclidean at fixed dim
WordNet hypernymy, Gene Ontology, citation chains<0.05< 0.05Large; Nickel-Kiela report 30%+ MAP improvements at 5-50 dim
Social networks (Facebook, citation networks)0.050.05 to 0.20.2Moderate; Chami et al. show small but consistent gains
Web graphs, biological PPI0.20.2 to 0.40.4Marginal; depends on subgraph and task
Road networks, 2D meshes, regular grids>0.4> 0.4None or negative; Euclidean is the right choice
Random Erdos-Renyi G(n,p)G(n,p)Θ(logn/lognp)\Theta(\log n / \log np)None for typical pp

The takeaway: estimate δ\delta before committing. It is a one-pass sampling computation.

Worked Example: a 7-node binary tree in B2\mathbb{B}^2

Take the perfect binary tree TT with root rr, children u1,u2u_1,u_2, grandchildren vijv_{ij} for i,j{1,2}i,j \in \{1,2\}. All edges have weight 11.

Place f(r)=0f(r)=0. Place the two children on opposite rays at hyperbolic distance 11:

f(u1)=(tanh(1/2),0),f(u2)=(tanh(1/2),0).f(u_1) = (\tanh(1/2), 0), \qquad f(u_2) = (-\tanh(1/2), 0).

For u1u_1, the angular cone facing away from rr is the right half-plane. Split it into two 9090^\circ sectors and place the grandchildren on the bisectors at hyperbolic distance 11 from u1u_1. This requires Mobius translation: f(v11)=f(u1)exp0((cos45,sin45))f(v_{11}) = f(u_1) \oplus \exp_0((\cos 45^\circ, \sin 45^\circ)), and similarly for v12v_{12}.

The hyperbolic distance dB(f(v11),f(v12))d_{\mathbb{B}}(f(v_{11}), f(v_{12})) comes out to roughly 1.931.93, and the tree distance is 22. With scale factor τ=2\tau=2, the same construction gives dB(τf(v11),τf(v12))3.97d_{\mathbb{B}}(\tau f(v_{11}), \tau f(v_{12})) \approx 3.97 versus tree distance 44, distortion 1.01\approx 1.01. Pushing τ\tau higher pushes the ratio further toward 11, illustrating the construction in the proof sketch.

ExerciseCore

Problem

Verify directly that the Poincare ball model has constant curvature 1-1 at the origin. Specifically: compute the geodesic distance from 00 to a point xBnx \in \mathbb{B}^n along the Euclidean segment, and show it equals 2arctanh(x)2\,\mathrm{arctanh}(\|x\|). Then compute the circumference of the hyperbolic circle of radius RR centered at 00 and show it equals 2πsinhR2\pi \sinh R, the signature of curvature 1-1.

ExerciseAdvanced

Problem

Prove the Euclidean lower bound for the path graph: any embedding of the unweighted path PNP_N on NN nodes into Euclidean space Rd\mathbb{R}^d has distortion Ω(1)\Omega(1), but any embedding of the complete binary tree BhB_h on N=2h+11N = 2^{h+1}-1 nodes has distortion Ω(logN)\Omega(\sqrt{\log N}).

Then explain why "complete binary tree" is the right witness: what specific feature of BhB_h defeats Euclidean embedding that paths and stars do not have?

ExerciseAdvanced

Problem

Implement Mobius addition xyx \oplus y in numpy and verify three properties on random points in B2\mathbb{B}^2:

  1. Left identity: 0y=y0 \oplus y = y.
  2. Left inverse: (x)(xy)=y(\ominus x) \oplus (x \oplus y) = y.
  3. Non-associativity: there exist x,y,zx,y,z such that (xy)zx(yz)(x \oplus y) \oplus z \neq x \oplus (y \oplus z).

Then describe what the "gyration" gyr[x,y]\mathrm{gyr}[x,y] corrects for and where it appears in the Ganea hyperbolic-MLP code (in scalar matrix multiplication versus pointwise activation).

References

Related Topics

Last reviewed: April 18, 2026

Prerequisites

Foundations this topic depends on.

Next Topics