Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Foundations

Common Inequalities

The algebraic and probabilistic inequalities that appear everywhere in ML theory: Cauchy-Schwarz, Jensen, AM-GM, Holder, Minkowski, Young, Markov, and Chebyshev.

CoreTier 1Stable~50 min

Why This Matters

Inequalities are the grammar of mathematical proofs. In ML theory, you cannot write a single generalization bound, convergence proof, or complexity argument without reaching for at least one inequality from this page. Cauchy-Schwarz bounds inner products. Jensen connects convex functions to expectations. Markov and Chebyshev give tail bounds.

These are not abstract exercises. They are tools you will use on every page of this site. Know them so well that applying them is automatic.

Mental Model

The inequalities on this page fall into three groups:

  1. Inner product / norm inequalities: Cauchy-Schwarz, Holder, Minkowski. These relate different norms and inner products. Used to bound terms that involve products or sums.

  2. Convexity inequalities: Jensen, AM-GM, Young. These exploit the curvature of convex (or concave) functions. Jensen is the most important single inequality in ML theory.

  3. Probabilistic inequalities: Markov, Chebyshev. These bound tail probabilities using moment information. They are the foundation for all concentration inequalities.

Inner Product and Norm Inequalities

Theorem

Cauchy-Schwarz Inequality

Statement

For any vectors u,vu, v in an inner product space:

u,vuv|\langle u, v \rangle| \leq \|u\| \cdot \|v\|

Equality holds if and only if uu and vv are linearly dependent.

For random variables: E[XY]E[X2]E[Y2]|\mathbb{E}[XY]| \leq \sqrt{\mathbb{E}[X^2]\,\mathbb{E}[Y^2]}.

For sums: (iaibi)2(iai2)(ibi2)\left(\sum_i a_i b_i\right)^2 \leq \left(\sum_i a_i^2\right)\left(\sum_i b_i^2\right).

Intuition

The inner product of two vectors cannot exceed the product of their lengths. This is a generalization of the fact that cosθ1\cos\theta \leq 1. In probability, it says the correlation between two random variables is at most 1 in absolute value.

Proof Sketch

For any tRt \in \mathbb{R}, u+tv20\|u + tv\|^2 \geq 0. Expanding:

u2+2tu,v+t2v20\|u\|^2 + 2t\langle u, v\rangle + t^2\|v\|^2 \geq 0

This is a non-negative quadratic in tt, so its discriminant is non-positive:

4u,v24u2v204\langle u, v\rangle^2 - 4\|u\|^2\|v\|^2 \leq 0

which gives u,vuv|\langle u, v\rangle| \leq \|u\| \cdot \|v\|.

Why It Matters

Cauchy-Schwarz is the most frequently used inequality in all of mathematics. In ML theory, it appears when: bounding inner products in kernel methods, proving convergence of gradient descent, analyzing Rademacher complexity, and controlling cross-terms in bias-variance decompositions.

Failure Mode

Cauchy-Schwarz gives the tightest possible bound for inner products in terms of norms, but it can still be loose if uu and vv are nearly orthogonal (the bound gives uv\|u\|\|v\| while the true value is near zero). When you have structural information about the angle between uu and vv, tighter bounds are possible.

Definition

Holder Inequality

For p,q1p, q \geq 1 with 1p+1q=1\frac{1}{p} + \frac{1}{q} = 1 (conjugate exponents):

iaibi(iaip)1/p(ibiq)1/q\sum_i |a_i b_i| \leq \left(\sum_i |a_i|^p\right)^{1/p} \left(\sum_i |b_i|^q\right)^{1/q}

Or in integral form: fgfpgq\int |fg| \leq \|f\|_p \|g\|_q.

Special case: p=q=2p = q = 2 gives Cauchy-Schwarz.

When it arises: Bounding mixed-norm expressions, dual norm computations (xp=xq\|x\|_p^* = \|x\|_q), and proving interpolation inequalities. The duality between p\ell_p and q\ell_q norms underpins convex analysis.

Definition

Minkowski Inequality (Triangle Inequality for L^p)

For p1p \geq 1:

(iai+bip)1/p(iaip)1/p+(ibip)1/p\left(\sum_i |a_i + b_i|^p\right)^{1/p} \leq \left(\sum_i |a_i|^p\right)^{1/p} + \left(\sum_i |b_i|^p\right)^{1/p}

In short: a+bpap+bp\|a + b\|_p \leq \|a\|_p + \|b\|_p.

This is the triangle inequality for p\ell_p norms, proven using Holder. It confirms that p\|\cdot\|_p is a valid norm for p1p \geq 1. (For p<1p < 1, the triangle inequality fails and p\|\cdot\|_p is not a norm.)

Convexity Inequalities

Jensen's Inequality: f(E[X]) ≤ E[f(X)] for convex f

f(x)f(E[X])E[f(X)]gapx₁x₂E[X]xf(x)
Theorem

Jensen Inequality

Statement

If ff is a convex function and XX is a random variable:

f(E[X])E[f(X)]f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]

If ff is concave, the inequality reverses: f(E[X])E[f(X)]f(\mathbb{E}[X]) \geq \mathbb{E}[f(X)].

Finite form: For weights wi0w_i \geq 0 with iwi=1\sum_i w_i = 1:

f ⁣(iwixi)iwif(xi)f\!\left(\sum_i w_i x_i\right) \leq \sum_i w_i f(x_i)

Intuition

A convex function "bows upward." The value of ff at the average point is at most the average of ff at the individual points. Geometrically: the chord connecting two points on a convex curve lies above the curve.

Proof Sketch

By convexity, ff has a supporting hyperplane at E[X]\mathbb{E}[X]: there exists aa such that f(x)f(E[X])+a(xE[X])f(x) \geq f(\mathbb{E}[X]) + a(x - \mathbb{E}[X]) for all xx. Taking expectations:

E[f(X)]f(E[X])+aE[XE[X]]=f(E[X])\mathbb{E}[f(X)] \geq f(\mathbb{E}[X]) + a \cdot \mathbb{E}[X - \mathbb{E}[X]] = f(\mathbb{E}[X])

since E[XE[X]]=0\mathbb{E}[X - \mathbb{E}[X]] = 0.

Why It Matters

Jensen is arguably the single most important inequality in ML theory. Applications include:

  • EM algorithm: The ELBO is a lower bound on log-likelihood because log\log is concave: logE[X]E[logX]\log \mathbb{E}[X] \geq \mathbb{E}[\log X].
  • KL divergence is non-negative: follows from Jensen applied to log-\log.
  • Convex loss functions: f(E[w^])E[f(w^)]f(\mathbb{E}[\hat{w}]) \leq \mathbb{E}[f(\hat{w})] relates the risk of the average model to the average risk.
  • Information theory: entropy bounds, data processing inequality.

Failure Mode

Jensen gives a direction of inequality but not its magnitude. The gap E[f(X)]f(E[X])\mathbb{E}[f(X)] - f(\mathbb{E}[X]) can be anywhere from zero (when XX is constant or ff is linear) to very large (when XX has high variance and ff is strongly convex). Tighter bounds require knowledge of the variance of XX and the curvature of ff.

Definition

AM-GM Inequality (Arithmetic Mean - Geometric Mean)

For non-negative reals a1,,ana_1, \ldots, a_n:

a1+a2++ann(a1a2an)1/n\frac{a_1 + a_2 + \cdots + a_n}{n} \geq (a_1 a_2 \cdots a_n)^{1/n}

Equality holds if and only if all aia_i are equal.

Two-variable form: a+b2ab\frac{a + b}{2} \geq \sqrt{ab}, or equivalently ab(a+b)24ab \leq \frac{(a + b)^2}{4}.

Proof from Jensen: Apply Jensen to the concave function f(x)=logxf(x) = \log x: log ⁣(1nai)1nlogai=log ⁣(ai)1/n\log\!\left(\frac{1}{n}\sum a_i\right) \geq \frac{1}{n}\sum \log a_i = \log\!\left(\prod a_i\right)^{1/n}.

When it arises: Bounding products by sums (which are easier to handle), optimizing expressions involving products, proving convergence rates.

Definition

Young Inequality

For a,b0a, b \geq 0 and conjugate exponents p,q>1p, q > 1 with 1p+1q=1\frac{1}{p} + \frac{1}{q} = 1:

abapp+bqqab \leq \frac{a^p}{p} + \frac{b^q}{q}

Special case (p=q=2p = q = 2): aba22+b22ab \leq \frac{a^2}{2} + \frac{b^2}{2}, which is AM-GM for squares.

Parametric form: For any ϵ>0\epsilon > 0: abϵa22+b22ϵab \leq \frac{\epsilon a^2}{2} + \frac{b^2}{2\epsilon}. This is extremely useful in proofs where you need to "split" a product into two terms with different scaling.

When it arises: Splitting cross-terms in energy estimates, proving Holder from Young, absorbing error terms in convergence proofs. The parametric form lets you trade off how much of the bound goes into each term.

Probabilistic Inequalities

Definition

Markov Inequality

If X0X \geq 0, then for any t>0t > 0:

P(Xt)E[X]t\mathbb{P}(X \geq t) \leq \frac{\mathbb{E}[X]}{t}

Proof: E[X]E[X1Xt]tP(Xt)\mathbb{E}[X] \geq \mathbb{E}[X \cdot \mathbf{1}_{X \geq t}] \geq t \cdot \mathbb{P}(X \geq t).

When it arises: Markov is the foundation of all tail bounds. Chebyshev applies Markov to (Xμ)2(X - \mu)^2. The Chernoff method applies Markov to eλXe^{\lambda X}. Every concentration inequality starts with Markov.

Limitation: The bound decays as 1/t1/t. far too slow for most applications. You almost always want Hoeffding or Bernstein instead.

Definition

Chebyshev Inequality

For any random variable XX with mean μ\mu and finite variance σ2\sigma^2:

P(Xμt)σ2t2\mathbb{P}(|X - \mu| \geq t) \leq \frac{\sigma^2}{t^2}

Proof: Apply Markov to (Xμ)2(X - \mu)^2: P(Xμt)=P((Xμ)2t2)σ2/t2\mathbb{P}(|X - \mu| \geq t) = \mathbb{P}((X-\mu)^2 \geq t^2) \leq \sigma^2/t^2.

When it arises: Proving weak laws of large numbers, giving initial sample complexity bounds (n=O(1/ϵ2δ)n = O(1/\epsilon^2\delta) to estimate a mean within ϵ\epsilon at confidence 1δ1 - \delta). The polynomial tail 1/t21/t^2 is worse than the exponential tails from Hoeffding, but Chebyshev requires only finite variance. no boundedness assumption.

Applications in ML Theory

Here is how each inequality appears in practice:

InequalityTypical ML Use
Cauchy-SchwarzBounding inner products in kernel methods, RKHS norms
HolderDual norm computations, mixed-norm regularization
Jensen (log\log concave)EM algorithm ELBO, KL divergence non-negativity
Jensen (convex loss)Relating ensemble risk to average individual risk
AM-GMBounding products in convergence rate proofs
Young (parametric)Splitting cross-terms with tunable ϵ\epsilon
MarkovStarting point for all exponential tail bounds
ChebyshevQuick variance-based concentration when boundedness is unavailable

Canonical Examples

Example

Jensen in the EM algorithm

The EM algorithm maximizes a lower bound on the log-likelihood. For any distribution q(z)q(z) over latent variables:

logp(xθ)=logzp(x,zθ)=logzq(z)p(x,zθ)q(z)zq(z)logp(x,zθ)q(z)\log p(x \mid \theta) = \log \sum_z p(x, z \mid \theta) = \log \sum_z q(z) \frac{p(x, z \mid \theta)}{q(z)} \geq \sum_z q(z) \log \frac{p(x, z \mid \theta)}{q(z)}

The inequality is Jensen applied to the concave function log\log: log(Eq[f(Z)])Eq[logf(Z)]\log(\mathbb{E}_q[f(Z)]) \geq \mathbb{E}_q[\log f(Z)]. The right-hand side is the ELBO (Evidence Lower BOund).

Example

Cauchy-Schwarz in bounding gradient norms

In gradient descent analysis, you often need to bound f(x),xx\langle \nabla f(x), x - x^* \rangle. By Cauchy-Schwarz:

f(x),xxf(x)xx\langle \nabla f(x), x - x^* \rangle \leq \|\nabla f(x)\| \cdot \|x - x^*\|

This converts an inner product (hard to control) into a product of norms (each of which can be bounded separately).

Example

Young inequality to split cross-terms

In proving convergence of gradient descent, terms like 2f(x),e2\langle \nabla f(x), e\rangle arise (where ee is an error). Using the parametric Young inequality:

2f(x),e2f(x)eϵf(x)2+1ϵe22|\langle \nabla f(x), e\rangle| \leq 2\|\nabla f(x)\|\|e\| \leq \epsilon\|\nabla f(x)\|^2 + \frac{1}{\epsilon}\|e\|^2

By choosing ϵ\epsilon appropriately, the gradient term can be absorbed into the main descent lemma while the error term is controlled separately.

Common Confusions

Watch Out

Jensen goes the wrong way for concave functions

For convex ff: f(E[X])E[f(X)]f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]. For concave ff: f(E[X])E[f(X)]f(\mathbb{E}[X]) \geq \mathbb{E}[f(X)].

Students frequently apply Jensen to log\log (concave) and get the direction wrong. Remember: log(E[X])E[logX]\log(\mathbb{E}[X]) \geq \mathbb{E}[\log X] (concave Jensen), which gives KL0\text{KL} \geq 0 and the EM lower bound.

Watch Out

Cauchy-Schwarz is a special case of Holder, not the other way around

Holder with p=q=2p = q = 2 gives Cauchy-Schwarz. Holder is strictly more general (it works for any conjugate pair p,qp, q). But Cauchy-Schwarz is by far the most commonly used special case.

Summary

  • Cauchy-Schwarz: u,vuv|\langle u, v\rangle| \leq \|u\|\|v\|. the universal inner product bound
  • Holder: aibiapbq\sum|a_ib_i| \leq \|a\|_p\|b\|_q. generalizes CS to p\ell_p spaces
  • Jensen: f(E[X])E[f(X)]f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)] for convex ff. The most important inequality in ML theory
  • AM-GM: arithmetic mean \geq geometric mean. bounds products by sums
  • Young: abap/p+bq/qab \leq a^p/p + b^q/q. splits products with tunable balance
  • Markov: P(Xt)E[X]/tP(X \geq t) \leq \mathbb{E}[X]/t. foundation of all tail bounds
  • Chebyshev: P(Xμt)σ2/t2P(|X - \mu| \geq t) \leq \sigma^2/t^2. variance-based tails

Exercises

ExerciseCore

Problem

Use Jensen's inequality to show that the KL divergence is non-negative: DKL(pq)=Ep[log(p(X)/q(X))]0D_{\text{KL}}(p \| q) = \mathbb{E}_p[\log(p(X)/q(X))] \geq 0.

ExerciseCore

Problem

Use the parametric Young inequality to prove that for any a,b0a, b \geq 0 and ϵ>0\epsilon > 0: 2abϵa2+b2/ϵ2ab \leq \epsilon a^2 + b^2/\epsilon.

ExerciseAdvanced

Problem

Prove Holder's inequality for sums: if 1p+1q=1\frac{1}{p} + \frac{1}{q} = 1 with p,q>1p, q > 1, then iaibiapbq\sum_i |a_i b_i| \leq \|a\|_p \|b\|_q.

References

Canonical:

  • Hardy, Littlewood, Polya, Inequalities (1934). The classic reference
  • Steele, The Cauchy-Schwarz Master Class (2004). excellent problem-driven treatment

Current:

  • Boucheron, Lugosi, Massart, Concentration Inequalities (2013), Chapter 1
  • Cover & Thomas, Elements of Information Theory (2006), Chapter 2 for Jensen applications

Next Topics

Building on these foundational inequalities:

Last reviewed: April 2026

Builds on This

Next Topics