Foundations
Exponential Function Properties
The exponential function e^x: series definition, algebraic properties, and why it appears everywhere in ML. Softmax, MGFs, the Chernoff method, Boltzmann distributions, and exponential families all reduce to properties of exp.
Why This Matters
The exponential function is the single most frequently appearing function in machine learning theory.
Softmax: . This is how neural networks produce probability distributions.
Moment generating functions: . The entire Chernoff method is: apply Markov to and optimize over .
Boltzmann distribution: . This connects energy-based models to statistical mechanics.
Exponential families: densities of the form . Gaussians, Bernoullis, Poissons, and most standard distributions are exponential families.
All of these rely on the same algebraic properties of .
Core Definitions
Exponential Function
The exponential function is defined by the power series:
This series converges absolutely for all (and all ).
Natural Logarithm
The natural logarithm is the inverse of : for all , and for all . It is the unique function satisfying with .
Key Properties
The algebraic properties that make special:
- Homomorphism: . This turns addition into multiplication.
- Strict positivity: for all . The exponential never hits zero.
- Own derivative: . No other function (up to scaling) has this property.
- Monotonicity: is strictly increasing. If then .
- Convexity: is convex: .
Property 1 is why MGFs of independent sums factor. Property 2 is why is always a valid non-negative random variable for Markov's inequality. Property 5 (via Jensen's inequality) gives .
Main Theorems
Convergence of the Exponential Series
Statement
The series converges absolutely for every , and the resulting function satisfies:
Intuition
The factorial in the denominator grows much faster than any power in the numerator. For large , the terms shrink rapidly to zero regardless of how large is.
Proof Sketch
For convergence: . Apply the ratio test: the ratio of consecutive terms is . For the derivative property: differentiate the series term by term (justified by uniform convergence on compact sets). For the product rule: multiply the two series and use the binomial identity after reindexing.
Why It Matters
Absolute convergence for all means is defined and well-behaved everywhere. This is why you can always write without worrying about convergence, which makes the Chernoff method universally applicable (even though the resulting bound may be trivial for heavy-tailed distributions).
Failure Mode
The series definition is for the scalar exponential. The matrix exponential also converges for all square matrices, but requires . Non-commutativity breaks the homomorphism property.
The Exponential as the Unique Eigenfunction
The property characterizes uniquely (up to scalar multiples). This makes the eigenfunction of the differentiation operator with eigenvalue 1.
Uniqueness of the Exponential
Statement
If is differentiable, for all , and , then .
Intuition
Consider . Differentiating: . So is constant, and . Therefore for all .
Proof Sketch
Define . By the quotient rule, . So is constant. Evaluating at gives , hence .
Why It Matters
This uniqueness is why is the natural choice for generating moment bounds. Any other function satisfying the multiplicative property with and continuous must be for some constant . The exponential is not a convenience; it is the only option.
Failure Mode
The assumption is necessary. Without it, for any constant also satisfies . The zero function is also a solution but fails the initial condition.
Taylor Remainder and Approximation Bounds
Truncating the exponential series after terms gives a polynomial approximation. The error is controlled by the Lagrange remainder.
For , the -th order Taylor polynomial satisfies:
For , the Taylor polynomial alternates above and below , with .
The bound (the first-order approximation) is the most commonly used inequality in probability. The second-order bound holds for and is used in proofs of Hoeffding's inequality.
Connection to Moment Generating Functions
The MGF exploits every property of simultaneously:
- Positivity: , so and Markov's inequality applies.
- Homomorphism: if and are independent, . Sums become products.
- Own derivative: . Moments are encoded in derivatives.
- Convexity: Jensen's inequality gives .
The Chernoff method applies Markov's inequality to :
Optimizing over gives the tightest bound. This works precisely because is monotonically increasing in (for ) and non-negative. No other function family gives all these properties simultaneously.
The Chernoff method does not require bounded variables
The Chernoff bound is valid for any random variable whose MGF exists. If the MGF is infinite for all (as for heavy-tailed distributions like Cauchy), the bound is trivially and useless, but it does not fail. The method requires , not boundedness.
The Exponential in Softmax and Neural Networks
The softmax function converts raw logits into a probability distribution. Three properties of make this work.
First, strict positivity guarantees all probabilities are positive. Second, the homomorphism means adding a constant to all logits does not change the output: . This shift-invariance is what makes the log-sum-exp trick valid. Third, monotonicity ensures that larger logits map to larger probabilities, preserving the ranking.
Numerical softmax computation
Consider logits . Direct computation of overflows float64. The log-sum-exp trick subtracts :
, , .
Sum: . Softmax: .
Without the trick, every intermediate computation produces . With the trick, every intermediate value is between 0 and 1. The shift-invariance property guarantees the result is identical.
The bound is the workhorse inequality for sub-Gaussian concentration. In the proof of Hoeffding's inequality, you bound for a bounded, zero-mean random variable. The argument proceeds by noting that is convex in , so it lies below the chord connecting the endpoints of the interval . Taking expectations and using yields the sub-Gaussian MGF bound . Every step relies on convexity and the algebraic properties listed above.
The inequality (for ) appears in proofs where you need an upper bound on near zero. For the Chernoff bound on sums of independent Bernoulli random variables, this second-order approximation converts the MGF into a Gaussian-type bound without requiring the full Taylor series.
Log-Convexity and Log-Sum-Exp
The log-sum-exp function is convex. The softmax function is its gradient:
The log-sum-exp is a smooth approximation to the max function:
This approximation is tight when one component dominates.
Common Confusions
e^x is convex, log x is concave
These are dual properties. Convexity of gives Jensen's inequality: . Concavity of gives the reverse direction: . Both forms of Jensen are used constantly. The first appears in MGF bounds; the second in information theory (proving KL divergence is non-negative).
Numerical overflow in exp
Computing overflows floating point. Computing requires the log-sum-exp trick: subtract from all components before exponentiating. This does not change the result (by the homomorphism property) but prevents overflow. Every practical softmax implementation uses this trick.
Exercises
Problem
Prove that for all , with equality only at .
Problem
Prove that the log-sum-exp function is convex.
Problem
Show that for . When is this bound tighter than ?
Problem
Let be a random variable with and almost surely. Show that for all . This is the key lemma for Hoeffding's inequality.
References
Canonical:
- Rudin, Principles of Mathematical Analysis (1976), Chapter 8 (power series, exponential and logarithmic functions)
- Apostol, Mathematical Analysis (1974), Chapter 6 (the exponential function and related functions)
- Bartle & Sherbert, Introduction to Real Analysis (2011), Chapter 8.3 (the exponential and logarithmic functions)
Current:
- Boyd & Vandenberghe, Convex Optimization (2004), Section 3.1.5 (log-sum-exp convexity)
- Goodfellow, Bengio, Courville, Deep Learning (2016), Section 4.1 (softmax and numerical stability)
- Boucheron, Lugosi, Massart, Concentration Inequalities (2013), Chapter 2 (the Cramér-Chernoff method and exponential bounds)
- Vershynin, High-Dimensional Probability (2018), Chapter 2.2 (sub-Gaussian properties and MGF bounds)
Last reviewed: April 2026