Exponential Function Properties

Sneiderman, Robby

Foundations

Exponential Function Properties

The exponential function exp(x): series definition, algebraic properties, and why it appears everywhere in ML. Softmax, MGFs, the Chernoff method, Boltzmann distributions, and exponential families all reduce to properties of exp.

CoreTier 1StableSupporting~30 min

Start 8-question practice · 13 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

foundations | layer 0A | tier 1. This page has 0 direct prerequisites and 7 published dependents.

Open Atlas Prerequisites Leads to

What next

Moment Generating Functions

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

The exponential function is the single most frequently appearing function in machine learning theory. The base shows up in every formula below.

Softmax: $p_i = e^{z_i} / \sum_j e^{z_j}$ . This is how neural networks produce probability distributions.

Moment generating functions: $M_X(t) = \mathbb{E}[e^{tX}]$ . The entire Chernoff method is: apply Markov to $e^{tX}$ and optimize over $t$ .

Boltzmann distribution: $p(x) \propto e^{-E(x)/T}$ . This connects energy-based models to statistical mechanics.

Exponential families: densities of the form $p(x|\theta) = h(x) \exp(\theta^T T(x) - A(\theta))$ . Gaussians, Bernoullis, Poissons, and most standard distributions are exponential families.

All of these rely on the same algebraic properties of $\exp$ .

Core Definitions

Definition

Exponential Function $e^{x} = exp (x)$

The exponential function is defined by the power series:

$e^x = \sum_{n=0}^{\infty} \frac{x^n}{n!} = 1 + x + \frac{x^2}{2} + \frac{x^3}{6} + \cdots$

This series converges absolutely for all $x \in \mathbb{R}$ (and all $z \in \mathbb{C}$ ).

Definition

Natural Logarithm $lo g x = ln x$

The natural logarithm is the inverse of $\exp$ : $\log(e^x) = x$ for all $x \in \mathbb{R}$ , and $e^{\log y} = y$ for all $y > 0$ . It is the unique function satisfying $\log(ab) = \log a + \log b$ with $\log(e) = 1$ .

Key Properties

The algebraic properties that make $\exp$ special:

Homomorphism: $e^{a+b} = e^a \cdot e^b$ . This turns addition into multiplication.
Strict positivity: $e^x > 0$ for all $x \in \mathbb{R}$ . The exponential never hits zero.
Own derivative: $\frac{d}{dx} e^x = e^x$ . No other function (up to scaling) has this property.
Monotonicity: $e^x$ is strictly increasing. If $a < b$ then $e^a < e^b$ .
Convexity: $e^x$ is convex: $e^{\lambda a + (1-\lambda)b} \leq \lambda e^a + (1-\lambda) e^b$ .

Property 1 is why MGFs of independent sums factor. Property 2 is why $e^{tX}$ is always a valid non-negative random variable for Markov's inequality. Property 5 (via Jensen's inequality) gives $e^{\mathbb{E}[X]} \leq \mathbb{E}[e^X]$ .

Main Theorems

Theorem

Convergence of the Exponential Series

Statement

The series $\sum_{n=0}^{\infty} x^n / n!$ converges absolutely for every $x \in \mathbb{R}$ , and the resulting function satisfies:

$\frac{d}{dx} e^x = e^x, \quad e^0 = 1, \quad e^{a+b} = e^a e^b$

Intuition

The factorial $n!$ in the denominator grows much faster than any power $x^n$ in the numerator. For large $n$ , the terms $x^n/n!$ shrink rapidly to zero regardless of how large $x$ is.

Proof Sketch

For convergence: $|x^n/n!| \leq |x|^n/n!$ . Apply the ratio test: the ratio of consecutive terms is $|x|/(n+1) \to 0$ . For the derivative property: differentiate the series term by term (justified by uniform convergence on compact sets). For the product rule: multiply the two series and use the binomial identity $\sum_{k=0}^n \binom{n}{k} a^k b^{n-k}/ n! = (a+b)^n/n!$ after reindexing.

Why It Matters

Absolute convergence for all $x$ means $\exp$ is defined and well-behaved everywhere. This is why you can always write $e^{tX}$ without worrying about convergence, which makes the Chernoff method universally applicable (even though the resulting bound may be trivial for heavy-tailed distributions).

Failure Mode

The series definition is for the scalar exponential. The matrix exponential $e^A = \sum A^n/n!$ also converges for all square matrices, but $e^{A+B} = e^A e^B$ holds only when $A$ and $B$ commute ( $AB = BA$ ). In general $e^{A+B} \neq e^A e^B$ , and the correct product is given by the Baker-Campbell-Hausdorff formula, whose leading correction is $e^A e^B = e^{A + B + [A,B]/2 + \cdots}$ where $[A,B] = AB - BA$ . Non-commutativity breaks the homomorphism property.

report a correction →

The Exponential as the Unique Eigenfunction

The property $\frac{d}{dx} e^x = e^x$ characterizes $\exp$ uniquely (up to scalar multiples). This makes $\exp$ the eigenfunction of the differentiation operator with eigenvalue 1.

Proposition

Uniqueness of the Exponential

Statement

If $f: \mathbb{R} \to \mathbb{R}$ is differentiable, $f'(x) = f(x)$ for all $x$ , and $f(0) = 1$ , then $f(x) = e^x$ .

Intuition

Consider $g(x) = f(x) e^{-x}$ . Differentiating: $g'(x) = f'(x)e^{-x} - f(x)e^{-x} = 0$ . So $g$ is constant, and $g(0) = f(0) \cdot 1 = 1$ . Therefore $f(x) = e^x$ for all $x$ .

Proof Sketch

Define $g(x) = f(x)/e^x$ . By the quotient rule, $g'(x) = (f'(x)e^x - f(x)e^x)/e^{2x} = 0$ . So $g$ is constant. Evaluating at $x = 0$ gives $g(0) = 1$ , hence $f(x) = e^x$ .

Why It Matters

This uniqueness is why $e^{tX}$ is the natural choice for generating moment bounds. Any other function satisfying the multiplicative property $f(a+b) = f(a)f(b)$ with $f(0) = 1$ and continuous must be $e^{cx}$ for some constant $c$ . The exponential is not a convenience; it is the only option.

Failure Mode

The assumption $f(0) = 1$ is necessary. Without it, $f(x) = Ce^x$ for any constant $C$ also satisfies $f' = f$ . The zero function $f \equiv 0$ is also a solution but fails the initial condition.

report a correction →

Taylor Remainder and Approximation Bounds

Truncating the exponential series after $n$ terms gives a polynomial approximation. The error is controlled by the Lagrange remainder.

For $x \geq 0$ , the $n$ -th order Taylor polynomial $T_n(x) = \sum_{k=0}^{n} x^k/k!$ satisfies:

$0 \leq e^x - T_n(x) \leq \frac{x^{n+1}}{(n+1)!} \cdot e^x$

For $x \leq 0$ , the Taylor polynomial alternates above and below $e^x$ , with $|e^x - T_n(x)| \leq |x|^{n+1}/(n+1)!$ .

The bound $e^x \geq 1 + x$ (the first-order approximation) is the most commonly used inequality in probability. The tighter second-order bound $e^x \leq 1 + x + x^2/2$ holds for $x \leq 0$ and is used in proofs of Hoeffding's inequality. The derivation: for $x \leq 0$ the Taylor series alternates in sign past the first term, and the partial sum $1 + x + x^2/2$ is followed by $x^3/6 \leq 0$ , so $1 + x + x^2/2 \geq e^x$ . The looser form $1 + x + x^2$ also holds but is slack by a factor of 2 on the quadratic term.

Connection to Moment Generating Functions

The MGF $M_X(t) = \mathbb{E}[e^{tX}]$ exploits every property of $\exp$ simultaneously:

Positivity: $e^{tX} > 0$ , so $M_X(t) > 0$ and Markov's inequality applies.
Homomorphism: if $X$ and $Y$ are independent, $M_{X+Y}(t) = M_X(t) M_Y(t)$ . Sums become products.
Own derivative: $\frac{d^n}{dt^n} M_X(t)\big|_{t=0} = \mathbb{E}[X^n]$ . Moments are encoded in derivatives.
Convexity: Jensen's inequality gives $e^{\mathbb{E}[tX]} \leq M_X(t)$ .

The Chernoff method applies Markov's inequality to $e^{tX}$ :

$P(X \geq a) = P(e^{tX} \geq e^{ta}) \leq \frac{\mathbb{E}[e^{tX}]}{e^{ta}} = M_X(t) e^{-ta}$

Optimizing over $t > 0$ gives the tightest bound. This works precisely because $e^{tx}$ is monotonically increasing in $x$ (for $t > 0$ ) and non-negative. No other function family gives all these properties simultaneously.

Watch Out

The Chernoff method does not require bounded variables

The Chernoff bound $P(X \geq a) \leq \inf_{t > 0} M_X(t) e^{-ta}$ is valid for any random variable whose MGF exists. If the MGF is infinite for all $t > 0$ (as for heavy-tailed distributions like Cauchy), the bound is trivially $+\infty$ and useless, but it does not fail. The method requires $M_X(t) < \infty$ , not boundedness.

The Exponential in Softmax and Neural Networks

The softmax function $p_i = e^{z_i} / \sum_j e^{z_j}$ converts raw logits into a probability distribution. Three properties of $\exp$ make this work.

First, strict positivity guarantees all probabilities are positive. Second, the homomorphism $e^{a+b} = e^a e^b$ means adding a constant to all logits does not change the output: $e^{z_i + c} / \sum_j e^{z_j + c} = e^{z_i} / \sum_j e^{z_j}$ . This shift-invariance is what makes the log-sum-exp trick valid. Third, monotonicity ensures that larger logits map to larger probabilities, preserving the ranking.

Example

Numerical softmax computation

Consider logits $z = [1000, 1001, 999]$ . Direct computation of $e^{1000}$ overflows float64. The log-sum-exp trick subtracts $c = \max(z) = 1001$ :

$e^{1000 - 1001} = e^{-1} \approx 0.368$ , $e^{1001 - 1001} = 1$ , $e^{999 - 1001} = e^{-2} \approx 0.135$ .

Sum: $0.368 + 1 + 0.135 = 1.503$ . Softmax: $[0.245, 0.665, 0.090]$ .

Without the trick, every intermediate computation produces $+\infty$ . With the trick, every intermediate value is between 0 and 1. The shift-invariance property guarantees the result is identical.

The bound $e^x \geq 1 + x$ is the workhorse inequality for sub-Gaussian concentration. In the proof of Hoeffding's inequality, you bound $\mathbb{E}[e^{tX}]$ for a bounded, zero-mean random variable. The argument proceeds by noting that $e^{tx}$ is convex in $x$ , so it lies below the chord connecting the endpoints of the interval $[a, b]$ . Taking expectations and using $\mathbb{E}[X] = 0$ yields the sub-Gaussian MGF bound $\mathbb{E}[e^{tX}] \leq e^{t^2(b-a)^2/8}$ . Every step relies on convexity and the algebraic properties listed above.

The inequality $e^x \leq 1 + x + x^2/2$ (for $x \leq 0$ ) appears in proofs where you need an upper bound on $e^x$ near zero. For the Chernoff bound on sums of independent Bernoulli random variables, this second-order approximation converts the MGF into a Gaussian-type bound without requiring the full Taylor series. The constant $1/2$ on the quadratic term is what matches Hoeffding's lemma $(b-a)^2/8$ downstream.

Log-Convexity and Log-Sum-Exp

The log-sum-exp function $\text{LSE}(z_1, \ldots, z_n) = \log(\sum_{i=1}^n e^{z_i})$ is convex. The softmax function is its gradient:

$\frac{\partial}{\partial z_i} \text{LSE}(z) = \frac{e^{z_i}}{\sum_j e^{z_j}} = \text{softmax}(z)_i$

The log-sum-exp is a smooth approximation to the max function:

$\max_i z_i \leq \text{LSE}(z) \leq \max_i z_i + \log n$

This approximation is tight when one component dominates.

Common Confusions

Watch Out

e^x is convex, log x is concave

These are dual properties. Convexity of $e^x$ gives Jensen's inequality: $e^{\mathbb{E}[X]} \leq \mathbb{E}[e^X]$ . Concavity of $\log$ gives the reverse direction: $\mathbb{E}[\log X] \leq \log \mathbb{E}[X]$ . Both forms of Jensen are used constantly. The first appears in MGF bounds; the second in information theory (proving KL divergence is non-negative).

Watch Out

Numerical overflow in exp

Computing $e^{1000}$ overflows floating point. Computing $\text{softmax}(z)$ requires the log-sum-exp trick: subtract $\max_i z_i$ from all components before exponentiating. This does not change the result (by the homomorphism property) but prevents overflow. Every practical softmax implementation uses this trick.

Exercises

ExerciseCore

Problem

Prove that $e^x \geq 1 + x$ for all $x \in \mathbb{R}$ , with equality only at $x = 0$ .

ExerciseAdvanced

Problem

Prove that the log-sum-exp function $\text{LSE}(z) = \log(\sum_{i=1}^n e^{z_i})$ is convex.

ExerciseCore

Problem

Show that $e^x \leq 1/(1-x)$ for $x < 1$ . When is this bound tighter than $e^x \geq 1 + x$ ?

ExerciseAdvanced

Problem

Let $X$ be a random variable with $\mathbb{E}[X] = 0$ and $|X| \leq c$ almost surely. Show that $\mathbb{E}[e^{tX}] \leq e^{t^2 c^2 / 2}$ for all $t \in \mathbb{R}$ . This is the key lemma for Hoeffding's inequality.

References

Canonical:

Rudin, Principles of Mathematical Analysis (1976), Chapter 8 (power series, exponential and logarithmic functions)
Apostol, Mathematical Analysis (1974), Chapter 6 (the exponential function and related functions)
Bartle & Sherbert, Introduction to Real Analysis (2011), Chapter 8.3 (the exponential and logarithmic functions)

Current:

Boyd & Vandenberghe, Convex Optimization (2004), Section 3.1.5 (log-sum-exp convexity)
Goodfellow, Bengio, Courville, Deep Learning (2016), Section 4.1 (softmax and numerical stability)
Boucheron, Lugosi, Massart, Concentration Inequalities (2013), Chapter 2 (the Cramér-Chernoff method and exponential bounds)
Vershynin, High-Dimensional Probability (2018), Chapter 2.2 (sub-Gaussian properties and MGF bounds)

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

0

No direct prerequisites are declared; this is treated as an entry point.

Derived topics

7

Common Probability Distributionslayer 0A · tier 1
Exponential Distributionlayer 0A · tier 1
Gamma Distributionlayer 0A · tier 1
Normal Distributionlayer 0A · tier 1
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1

+2 more on the derived-topics page.

Graph-backed continuations

Moment Generating Functions Common Probability Distributions Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency Fast Fourier Transform