Numerical Optimization
Bayesian Optimization for Hyperparameters
Hyperparameter tuning as black-box optimization with a Gaussian process surrogate. Acquisition functions (EI, UCB, PI), sample efficiency for expensive evaluations, TPE as an alternative, and why grid search is wasteful.
Prerequisites
Why This Matters
Training a neural network takes hours or days. Evaluating a single hyperparameter configuration requires a full training run. You cannot afford thousands of evaluations. Bayesian optimization uses a probabilistic model of the objective function to choose which configurations to evaluate next, achieving good results in far fewer evaluations than grid or random search.
The key insight: after each evaluation, you have information about the objective surface. Use that information to decide where to look next, rather than searching blindly.
Problem Setup
Hyperparameter Optimization as Black-Box Optimization
Let be the objective function (e.g., validation loss as a function of hyperparameters). We want to find:
We can only evaluate at chosen points. Each evaluation is expensive (a full training run). We do not have access to . The function may be noisy (validation loss varies across random seeds).
The Bayesian optimization loop:
- Fit a surrogate model to all observations
- Use the surrogate to compute an acquisition function
- Evaluate at
- Add to observations. Repeat.
The Surrogate Model: Gaussian Process
A Gaussian process provides a posterior distribution over given observations. After observing evaluations, the GP posterior gives a mean prediction and uncertainty at any point :
The mean is the best estimate of . The variance is high where we have few observations (uncertain regions) and low near observed points.
Acquisition Functions
The acquisition function balances exploitation (evaluate where is low, i.e., the current best guess) with exploration (evaluate where is high, i.e., uncertain regions that might contain the optimum).
Expected Improvement Closed Form
Statement
The Expected Improvement acquisition function has a closed-form expression under a GP posterior:
where , and , are the standard normal CDF and PDF respectively. When , .
Intuition
EI asks: "In expectation, how much better than the current best would an evaluation at be?" The first term rewards points predicted to be good (). The second term rewards uncertainty ( large), because uncertain points might be much better than predicted. Both exploitation and exploration emerge from a single principled criterion.
Proof Sketch
. Substitute and split into two standard Gaussian integrals. The result follows from the identities and .
Why It Matters
EI is the most widely used acquisition function in practice. Its closed form makes it cheap to evaluate and optimize. It naturally balances exploration and exploitation without requiring a tuning parameter (unlike UCB).
Failure Mode
EI can be myopic: it optimizes the expected improvement for the next single evaluation, not for a budget of remaining evaluations. It can also get stuck in local optima of the acquisition function if the GP posterior is multimodal.
Upper Confidence Bound (UCB): choose where is a parameter controlling exploration. Larger means more exploration.
Probability of Improvement (PI): choose the point most likely to improve on : with the same as EI. PI is purely exploitative and tends to over-exploit in practice.
Regret Bounds
GP-UCB Cumulative Regret Bound
Statement
Let be a function in the RKHS of kernel with RKHS norm at most . Running GP-UCB with for rounds, the cumulative regret satisfies, with probability at least :
where is the maximum information gain after observations, which depends on the kernel. For the squared exponential kernel in dimensions, .
Intuition
The regret grows sublinearly in : the average regret per evaluation goes to zero. The rate depends on the kernel's information gain , which measures how quickly the GP learns about . Smoother functions (smaller ) are easier to optimize.
Proof Sketch
At each round, the UCB acquisition function ensures that with high probability. The regret at round is bounded by . Summing and applying Cauchy-Schwarz gives . The sum of predictive variances is bounded by the information gain .
Why It Matters
This provides a theoretical guarantee for Bayesian optimization. It shows that GP-UCB finds near-optimal hyperparameters with a number of evaluations that depends on the smoothness of the objective (via ) rather than the dimensionality of the search space (which grid search depends on exponentially).
Failure Mode
The bound requires to lie in the RKHS of the chosen kernel. If the kernel is misspecified (e.g., using a smooth SE kernel for a non-smooth objective), the bound does not hold. The bound also grows with input dimension through , making Bayesian optimization less efficient for high-dimensional search spaces (typically ).
Tree-Structured Parzen Estimator (TPE)
TPE (Bergstra et al., 2011) models instead of . It splits observed hyperparameters into "good" () and "bad" () groups, fits kernel density estimators and to each, and maximizes the ratio .
Advantage over GP: TPE handles categorical and conditional hyperparameters naturally (e.g., "use Adam" vs "use SGD", where Adam has a parameter that SGD does not).
Disadvantage: TPE lacks the theoretical guarantees of GP-UCB. The density estimation can be poor with few observations.
Why Grid Search is Wasteful
Grid search evaluates a regular grid over the hyperparameter space. For hyperparameters with values each, it requires evaluations. With and , this is 100,000 evaluations.
Bergstra & Bengio (2012) showed that random search outperforms grid search in the same budget because: (1) in most problems, only a few hyperparameters matter, and (2) grid search wastes evaluations exploring unimportant dimensions. Random search projects uniformly onto each dimension, giving better coverage of the important ones.
Bayesian optimization further improves on random search by using past results to guide future evaluations.
Common Confusions
Bayesian optimization is not Bayesian hyperparameter inference
Bayesian optimization uses a Bayesian model (the GP) as a tool for optimization. It does not compute a posterior over hyperparameters. The GP models the objective function , not the hyperparameter distribution. The output is a point estimate , not a posterior.
GP-UCB and EI are not equivalent
Both balance exploration and exploitation, but differently. UCB has an explicit exploration parameter that must be set. EI has no explicit parameter but can be myopic. They lead to different evaluation sequences and different theoretical guarantees.
Canonical Examples
Tuning learning rate and weight decay
Objective: validation loss of a ResNet-50 as a function of learning rate (log scale) and weight decay (log scale). A GP with Matern-5/2 kernel over the log-transformed space. After 20 evaluations (each a full ImageNet training run), Bayesian optimization typically finds a configuration within 0.5% of the best found by 200 random search evaluations.
Summary
- Bayesian optimization uses a GP surrogate to model the objective function
- Acquisition functions (EI, UCB, PI) decide where to evaluate next
- EI has a closed form and balances exploration/exploitation without tuning
- GP-UCB has sublinear regret guarantees depending on kernel smoothness
- TPE handles categorical hyperparameters but lacks theoretical guarantees
- Grid search is exponential in dimension; random search is a better baseline
- Bayesian optimization is most valuable when each evaluation is expensive
Exercises
Problem
You have budget for 30 evaluations of a function with 3 continuous hyperparameters. Compare the number of distinct values each hyperparameter gets under grid search ( values each, giving 27 evaluations) versus random search (30 random points). Which gives better coverage per dimension?
Problem
Derive the Expected Improvement when and when . Interpret both cases.
References
Canonical:
- Srinivas et al., "Gaussian Process Optimization in the Bandit Setting" (GP-UCB, 2010), Sections 1-4
- Bergstra & Bengio, "Random Search for Hyper-Parameter Optimization" (2012)
Current:
- Frazier, "A Tutorial on Bayesian Optimization" (2018)
- Bergstra et al., "Algorithms for Hyper-Parameter Optimization" (TPE, 2011)
- Shahriari et al., "Taking the Human Out of the Loop" (2016), Sections 2-5
Next Topics
- This topic builds on Gaussian processes for the surrogate model theory
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Gaussian Processes for Machine LearningLayer 4
- Kernels and Reproducing Kernel Hilbert SpacesLayer 3
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Rademacher ComplexityLayer 3
- Empirical Risk MinimizationLayer 2
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- VC DimensionLayer 2