Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Foundations

Exponential Function Properties

The exponential function e^x: series definition, algebraic properties, and why it appears everywhere in ML. Softmax, MGFs, the Chernoff method, Boltzmann distributions, and exponential families all reduce to properties of exp.

CoreTier 1Stable~30 min

Why This Matters

The exponential function is the single most frequently appearing function in machine learning theory.

Softmax: pi=ezi/jezjp_i = e^{z_i} / \sum_j e^{z_j}. This is how neural networks produce probability distributions.

Moment generating functions: MX(t)=E[etX]M_X(t) = \mathbb{E}[e^{tX}]. The entire Chernoff method is: apply Markov to etXe^{tX} and optimize over tt.

Boltzmann distribution: p(x)eE(x)/Tp(x) \propto e^{-E(x)/T}. This connects energy-based models to statistical mechanics.

Exponential families: densities of the form p(xθ)=h(x)exp(θTT(x)A(θ))p(x|\theta) = h(x) \exp(\theta^T T(x) - A(\theta)). Gaussians, Bernoullis, Poissons, and most standard distributions are exponential families.

All of these rely on the same algebraic properties of exp\exp.

Core Definitions

Definition

Exponential Function

The exponential function is defined by the power series:

ex=n=0xnn!=1+x+x22+x36+e^x = \sum_{n=0}^{\infty} \frac{x^n}{n!} = 1 + x + \frac{x^2}{2} + \frac{x^3}{6} + \cdots

This series converges absolutely for all xRx \in \mathbb{R} (and all zCz \in \mathbb{C}).

Definition

Natural Logarithm

The natural logarithm is the inverse of exp\exp: log(ex)=x\log(e^x) = x for all xRx \in \mathbb{R}, and elogy=ye^{\log y} = y for all y>0y > 0. It is the unique function satisfying log(ab)=loga+logb\log(ab) = \log a + \log b with log(e)=1\log(e) = 1.

Key Properties

The algebraic properties that make exp\exp special:

  1. Homomorphism: ea+b=eaebe^{a+b} = e^a \cdot e^b. This turns addition into multiplication.
  2. Strict positivity: ex>0e^x > 0 for all xRx \in \mathbb{R}. The exponential never hits zero.
  3. Own derivative: ddxex=ex\frac{d}{dx} e^x = e^x. No other function (up to scaling) has this property.
  4. Monotonicity: exe^x is strictly increasing. If a<ba < b then ea<ebe^a < e^b.
  5. Convexity: exe^x is convex: eλa+(1λ)bλea+(1λ)ebe^{\lambda a + (1-\lambda)b} \leq \lambda e^a + (1-\lambda) e^b.

Property 1 is why MGFs of independent sums factor. Property 2 is why etXe^{tX} is always a valid non-negative random variable for Markov's inequality. Property 5 (via Jensen's inequality) gives eE[X]E[eX]e^{\mathbb{E}[X]} \leq \mathbb{E}[e^X].

Main Theorems

Theorem

Convergence of the Exponential Series

Statement

The series n=0xn/n!\sum_{n=0}^{\infty} x^n / n! converges absolutely for every xRx \in \mathbb{R}, and the resulting function satisfies:

ddxex=ex,e0=1,ea+b=eaeb\frac{d}{dx} e^x = e^x, \quad e^0 = 1, \quad e^{a+b} = e^a e^b

Intuition

The factorial n!n! in the denominator grows much faster than any power xnx^n in the numerator. For large nn, the terms xn/n!x^n/n! shrink rapidly to zero regardless of how large xx is.

Proof Sketch

For convergence: xn/n!xn/n!|x^n/n!| \leq |x|^n/n!. Apply the ratio test: the ratio of consecutive terms is x/(n+1)0|x|/(n+1) \to 0. For the derivative property: differentiate the series term by term (justified by uniform convergence on compact sets). For the product rule: multiply the two series and use the binomial identity k=0n(nk)akbnk/n!=(a+b)n/n!\sum_{k=0}^n \binom{n}{k} a^k b^{n-k}/ n! = (a+b)^n/n! after reindexing.

Why It Matters

Absolute convergence for all xx means exp\exp is defined and well-behaved everywhere. This is why you can always write etXe^{tX} without worrying about convergence, which makes the Chernoff method universally applicable (even though the resulting bound may be trivial for heavy-tailed distributions).

Failure Mode

The series definition is for the scalar exponential. The matrix exponential eA=An/n!e^A = \sum A^n/n! also converges for all square matrices, but eA+B=eAeBe^{A+B} = e^A e^B requires AB=BAAB = BA. Non-commutativity breaks the homomorphism property.

The Exponential as the Unique Eigenfunction

The property ddxex=ex\frac{d}{dx} e^x = e^x characterizes exp\exp uniquely (up to scalar multiples). This makes exp\exp the eigenfunction of the differentiation operator with eigenvalue 1.

Proposition

Uniqueness of the Exponential

Statement

If f:RRf: \mathbb{R} \to \mathbb{R} is differentiable, f(x)=f(x)f'(x) = f(x) for all xx, and f(0)=1f(0) = 1, then f(x)=exf(x) = e^x.

Intuition

Consider g(x)=f(x)exg(x) = f(x) e^{-x}. Differentiating: g(x)=f(x)exf(x)ex=0g'(x) = f'(x)e^{-x} - f(x)e^{-x} = 0. So gg is constant, and g(0)=f(0)1=1g(0) = f(0) \cdot 1 = 1. Therefore f(x)=exf(x) = e^x for all xx.

Proof Sketch

Define g(x)=f(x)/exg(x) = f(x)/e^x. By the quotient rule, g(x)=(f(x)exf(x)ex)/e2x=0g'(x) = (f'(x)e^x - f(x)e^x)/e^{2x} = 0. So gg is constant. Evaluating at x=0x = 0 gives g(0)=1g(0) = 1, hence f(x)=exf(x) = e^x.

Why It Matters

This uniqueness is why etXe^{tX} is the natural choice for generating moment bounds. Any other function satisfying the multiplicative property f(a+b)=f(a)f(b)f(a+b) = f(a)f(b) with f(0)=1f(0) = 1 and continuous must be ecxe^{cx} for some constant cc. The exponential is not a convenience; it is the only option.

Failure Mode

The assumption f(0)=1f(0) = 1 is necessary. Without it, f(x)=Cexf(x) = Ce^x for any constant CC also satisfies f=ff' = f. The zero function f0f \equiv 0 is also a solution but fails the initial condition.

Taylor Remainder and Approximation Bounds

Truncating the exponential series after nn terms gives a polynomial approximation. The error is controlled by the Lagrange remainder.

For x0x \geq 0, the nn-th order Taylor polynomial Tn(x)=k=0nxk/k!T_n(x) = \sum_{k=0}^{n} x^k/k! satisfies:

0exTn(x)xn+1(n+1)!ex0 \leq e^x - T_n(x) \leq \frac{x^{n+1}}{(n+1)!} \cdot e^x

For x0x \leq 0, the Taylor polynomial alternates above and below exe^x, with exTn(x)xn+1/(n+1)!|e^x - T_n(x)| \leq |x|^{n+1}/(n+1)!.

The bound ex1+xe^x \geq 1 + x (the first-order approximation) is the most commonly used inequality in probability. The second-order bound ex1+x+x2e^x \leq 1 + x + x^2 holds for x0x \leq 0 and is used in proofs of Hoeffding's inequality.

Connection to Moment Generating Functions

The MGF MX(t)=E[etX]M_X(t) = \mathbb{E}[e^{tX}] exploits every property of exp\exp simultaneously:

  1. Positivity: etX>0e^{tX} > 0, so MX(t)>0M_X(t) > 0 and Markov's inequality applies.
  2. Homomorphism: if XX and YY are independent, MX+Y(t)=MX(t)MY(t)M_{X+Y}(t) = M_X(t) M_Y(t). Sums become products.
  3. Own derivative: dndtnMX(t)t=0=E[Xn]\frac{d^n}{dt^n} M_X(t)\big|_{t=0} = \mathbb{E}[X^n]. Moments are encoded in derivatives.
  4. Convexity: Jensen's inequality gives eE[tX]MX(t)e^{\mathbb{E}[tX]} \leq M_X(t).

The Chernoff method applies Markov's inequality to etXe^{tX}:

P(Xa)=P(etXeta)E[etX]eta=MX(t)etaP(X \geq a) = P(e^{tX} \geq e^{ta}) \leq \frac{\mathbb{E}[e^{tX}]}{e^{ta}} = M_X(t) e^{-ta}

Optimizing over t>0t > 0 gives the tightest bound. This works precisely because etxe^{tx} is monotonically increasing in xx (for t>0t > 0) and non-negative. No other function family gives all these properties simultaneously.

Watch Out

The Chernoff method does not require bounded variables

The Chernoff bound P(Xa)inft>0MX(t)etaP(X \geq a) \leq \inf_{t > 0} M_X(t) e^{-ta} is valid for any random variable whose MGF exists. If the MGF is infinite for all t>0t > 0 (as for heavy-tailed distributions like Cauchy), the bound is trivially ++\infty and useless, but it does not fail. The method requires MX(t)<M_X(t) < \infty, not boundedness.

The Exponential in Softmax and Neural Networks

The softmax function pi=ezi/jezjp_i = e^{z_i} / \sum_j e^{z_j} converts raw logits into a probability distribution. Three properties of exp\exp make this work.

First, strict positivity guarantees all probabilities are positive. Second, the homomorphism ea+b=eaebe^{a+b} = e^a e^b means adding a constant to all logits does not change the output: ezi+c/jezj+c=ezi/jezje^{z_i + c} / \sum_j e^{z_j + c} = e^{z_i} / \sum_j e^{z_j}. This shift-invariance is what makes the log-sum-exp trick valid. Third, monotonicity ensures that larger logits map to larger probabilities, preserving the ranking.

Example

Numerical softmax computation

Consider logits z=[1000,1001,999]z = [1000, 1001, 999]. Direct computation of e1000e^{1000} overflows float64. The log-sum-exp trick subtracts c=max(z)=1001c = \max(z) = 1001:

e10001001=e10.368e^{1000 - 1001} = e^{-1} \approx 0.368, e10011001=1e^{1001 - 1001} = 1, e9991001=e20.135e^{999 - 1001} = e^{-2} \approx 0.135.

Sum: 0.368+1+0.135=1.5030.368 + 1 + 0.135 = 1.503. Softmax: [0.245,0.665,0.090][0.245, 0.665, 0.090].

Without the trick, every intermediate computation produces ++\infty. With the trick, every intermediate value is between 0 and 1. The shift-invariance property guarantees the result is identical.

The bound ex1+xe^x \geq 1 + x is the workhorse inequality for sub-Gaussian concentration. In the proof of Hoeffding's inequality, you bound E[etX]\mathbb{E}[e^{tX}] for a bounded, zero-mean random variable. The argument proceeds by noting that etxe^{tx} is convex in xx, so it lies below the chord connecting the endpoints of the interval [a,b][a, b]. Taking expectations and using E[X]=0\mathbb{E}[X] = 0 yields the sub-Gaussian MGF bound E[etX]et2(ba)2/8\mathbb{E}[e^{tX}] \leq e^{t^2(b-a)^2/8}. Every step relies on convexity and the algebraic properties listed above.

The inequality ex1+x+x2e^x \leq 1 + x + x^2 (for x0x \leq 0) appears in proofs where you need an upper bound on exe^x near zero. For the Chernoff bound on sums of independent Bernoulli random variables, this second-order approximation converts the MGF into a Gaussian-type bound without requiring the full Taylor series.

Log-Convexity and Log-Sum-Exp

The log-sum-exp function LSE(z1,,zn)=log(i=1nezi)\text{LSE}(z_1, \ldots, z_n) = \log(\sum_{i=1}^n e^{z_i}) is convex. The softmax function is its gradient:

ziLSE(z)=ezijezj=softmax(z)i\frac{\partial}{\partial z_i} \text{LSE}(z) = \frac{e^{z_i}}{\sum_j e^{z_j}} = \text{softmax}(z)_i

The log-sum-exp is a smooth approximation to the max function:

maxiziLSE(z)maxizi+logn\max_i z_i \leq \text{LSE}(z) \leq \max_i z_i + \log n

This approximation is tight when one component dominates.

Common Confusions

Watch Out

e^x is convex, log x is concave

These are dual properties. Convexity of exe^x gives Jensen's inequality: eE[X]E[eX]e^{\mathbb{E}[X]} \leq \mathbb{E}[e^X]. Concavity of log\log gives the reverse direction: E[logX]logE[X]\mathbb{E}[\log X] \leq \log \mathbb{E}[X]. Both forms of Jensen are used constantly. The first appears in MGF bounds; the second in information theory (proving KL divergence is non-negative).

Watch Out

Numerical overflow in exp

Computing e1000e^{1000} overflows floating point. Computing softmax(z)\text{softmax}(z) requires the log-sum-exp trick: subtract maxizi\max_i z_i from all components before exponentiating. This does not change the result (by the homomorphism property) but prevents overflow. Every practical softmax implementation uses this trick.

Exercises

ExerciseCore

Problem

Prove that ex1+xe^x \geq 1 + x for all xRx \in \mathbb{R}, with equality only at x=0x = 0.

ExerciseAdvanced

Problem

Prove that the log-sum-exp function LSE(z)=log(i=1nezi)\text{LSE}(z) = \log(\sum_{i=1}^n e^{z_i}) is convex.

ExerciseCore

Problem

Show that ex1/(1x)e^x \leq 1/(1-x) for x<1x < 1. When is this bound tighter than ex1+xe^x \geq 1 + x?

ExerciseAdvanced

Problem

Let XX be a random variable with E[X]=0\mathbb{E}[X] = 0 and Xc|X| \leq c almost surely. Show that E[etX]et2c2/2\mathbb{E}[e^{tX}] \leq e^{t^2 c^2 / 2} for all tRt \in \mathbb{R}. This is the key lemma for Hoeffding's inequality.

References

Canonical:

  • Rudin, Principles of Mathematical Analysis (1976), Chapter 8 (power series, exponential and logarithmic functions)
  • Apostol, Mathematical Analysis (1974), Chapter 6 (the exponential function and related functions)
  • Bartle & Sherbert, Introduction to Real Analysis (2011), Chapter 8.3 (the exponential and logarithmic functions)

Current:

  • Boyd & Vandenberghe, Convex Optimization (2004), Section 3.1.5 (log-sum-exp convexity)
  • Goodfellow, Bengio, Courville, Deep Learning (2016), Section 4.1 (softmax and numerical stability)
  • Boucheron, Lugosi, Massart, Concentration Inequalities (2013), Chapter 2 (the Cramér-Chernoff method and exponential bounds)
  • Vershynin, High-Dimensional Probability (2018), Chapter 2.2 (sub-Gaussian properties and MGF bounds)

Last reviewed: April 2026

Next Topics