Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Numerical Optimization

Bayesian Optimization for Hyperparameters

Hyperparameter tuning as black-box optimization with a Gaussian process surrogate. Acquisition functions (EI, UCB, PI), sample efficiency for expensive evaluations, TPE as an alternative, and why grid search is wasteful.

AdvancedTier 2Stable~50 min

Why This Matters

Training a neural network takes hours or days. Evaluating a single hyperparameter configuration requires a full training run. You cannot afford thousands of evaluations. Bayesian optimization uses a probabilistic model of the objective function to choose which configurations to evaluate next, achieving good results in far fewer evaluations than grid or random search.

The key insight: after each evaluation, you have information about the objective surface. Use that information to decide where to look next, rather than searching blindly.

Problem Setup

Definition

Hyperparameter Optimization as Black-Box Optimization

Let f:XRf: \mathcal{X} \to \mathbb{R} be the objective function (e.g., validation loss as a function of hyperparameters). We want to find:

x=argminxXf(x)x^* = \arg\min_{x \in \mathcal{X}} f(x)

We can only evaluate ff at chosen points. Each evaluation is expensive (a full training run). We do not have access to f\nabla f. The function may be noisy (validation loss varies across random seeds).

The Bayesian optimization loop:

  1. Fit a surrogate model to all observations {(xi,yi)}i=1t\{(x_i, y_i)\}_{i=1}^t
  2. Use the surrogate to compute an acquisition function α(x)\alpha(x)
  3. Evaluate ff at xt+1=argmaxxα(x)x_{t+1} = \arg\max_x \alpha(x)
  4. Add (xt+1,yt+1)(x_{t+1}, y_{t+1}) to observations. Repeat.

The Surrogate Model: Gaussian Process

A Gaussian process provides a posterior distribution over ff given observations. After observing tt evaluations, the GP posterior gives a mean prediction μt(x)\mu_t(x) and uncertainty σt(x)\sigma_t(x) at any point xx:

f(x)DtN(μt(x),σt2(x))f(x) \mid \mathcal{D}_t \sim \mathcal{N}(\mu_t(x), \sigma_t^2(x))

The mean μt(x)\mu_t(x) is the best estimate of f(x)f(x). The variance σt2(x)\sigma_t^2(x) is high where we have few observations (uncertain regions) and low near observed points.

Acquisition Functions

The acquisition function balances exploitation (evaluate where μt(x)\mu_t(x) is low, i.e., the current best guess) with exploration (evaluate where σt(x)\sigma_t(x) is high, i.e., uncertain regions that might contain the optimum).

Proposition

Expected Improvement Closed Form

Statement

The Expected Improvement acquisition function has a closed-form expression under a GP posterior:

EI(x)=E[max(fbestf(x),0)]=(fbestμt(x))Φ(Z)+σt(x)ϕ(Z)\text{EI}(x) = \mathbb{E}[\max(f_{\text{best}} - f(x), 0)] = (f_{\text{best}} - \mu_t(x))\Phi(Z) + \sigma_t(x)\phi(Z)

where Z=(fbestμt(x))/σt(x)Z = (f_{\text{best}} - \mu_t(x))/\sigma_t(x), and Φ\Phi, ϕ\phi are the standard normal CDF and PDF respectively. When σt(x)=0\sigma_t(x) = 0, EI(x)=0\text{EI}(x) = 0.

Intuition

EI asks: "In expectation, how much better than the current best would an evaluation at xx be?" The first term rewards points predicted to be good (μt(x)<fbest\mu_t(x) < f_{\text{best}}). The second term rewards uncertainty (σt(x)\sigma_t(x) large), because uncertain points might be much better than predicted. Both exploitation and exploration emerge from a single principled criterion.

Proof Sketch

EI(x)=fbest(fbestf)N(f;μt(x),σt2(x))df\text{EI}(x) = \int_{-\infty}^{f_{\text{best}}} (f_{\text{best}} - f) \cdot \mathcal{N}(f; \mu_t(x), \sigma_t^2(x)) \, df. Substitute z=(fμt(x))/σt(x)z = (f - \mu_t(x))/\sigma_t(x) and split into two standard Gaussian integrals. The result follows from the identities azϕ(z)dz=ϕ(a)\int_{-\infty}^a z\phi(z)dz = -\phi(a) and aϕ(z)dz=Φ(a)\int_{-\infty}^a \phi(z)dz = \Phi(a).

Why It Matters

EI is the most widely used acquisition function in practice. Its closed form makes it cheap to evaluate and optimize. It naturally balances exploration and exploitation without requiring a tuning parameter (unlike UCB).

Failure Mode

EI can be myopic: it optimizes the expected improvement for the next single evaluation, not for a budget of BB remaining evaluations. It can also get stuck in local optima of the acquisition function if the GP posterior is multimodal.

Upper Confidence Bound (UCB): choose xt+1=argminx(μt(x)βtσt(x))x_{t+1} = \arg\min_x (\mu_t(x) - \beta_t \sigma_t(x)) where βt\beta_t is a parameter controlling exploration. Larger βt\beta_t means more exploration.

Probability of Improvement (PI): choose the point most likely to improve on fbestf_{\text{best}}: PI(x)=Φ(Z)\text{PI}(x) = \Phi(Z) with the same ZZ as EI. PI is purely exploitative and tends to over-exploit in practice.

Regret Bounds

Theorem

GP-UCB Cumulative Regret Bound

Statement

Let ff be a function in the RKHS of kernel kk with RKHS norm at most BB. Running GP-UCB with βt=B+4σ2(γt+1+ln(1/δ))\beta_t = B + 4\sigma\sqrt{2(\gamma_t + 1 + \ln(1/\delta))} for TT rounds, the cumulative regret satisfies, with probability at least 1δ1 - \delta:

RT=t=1T(f(xt)f(x))O(TβTγT)R_T = \sum_{t=1}^T (f(x_t) - f(x^*)) \leq O(\sqrt{T \beta_T \gamma_T})

where γT\gamma_T is the maximum information gain after TT observations, which depends on the kernel. For the squared exponential kernel in dd dimensions, γT=O((logT)d+1)\gamma_T = O((\log T)^{d+1}).

Intuition

The regret grows sublinearly in TT: the average regret per evaluation goes to zero. The rate depends on the kernel's information gain γT\gamma_T, which measures how quickly the GP learns about ff. Smoother functions (smaller γT\gamma_T) are easier to optimize.

Proof Sketch

At each round, the UCB acquisition function ensures that f(xt)μt1(xt)+βtσt1(xt)f(x_t) \leq \mu_{t-1}(x_t) + \beta_t \sigma_{t-1}(x_t) with high probability. The regret at round tt is bounded by 2βtσt1(xt)2\beta_t \sigma_{t-1}(x_t). Summing and applying Cauchy-Schwarz gives RT2βTTtσt12(xt)R_T \leq 2\beta_T \sqrt{T \sum_t \sigma_{t-1}^2(x_t)}. The sum of predictive variances is bounded by the information gain γT\gamma_T.

Why It Matters

This provides a theoretical guarantee for Bayesian optimization. It shows that GP-UCB finds near-optimal hyperparameters with a number of evaluations that depends on the smoothness of the objective (via γT\gamma_T) rather than the dimensionality of the search space (which grid search depends on exponentially).

Failure Mode

The bound requires ff to lie in the RKHS of the chosen kernel. If the kernel is misspecified (e.g., using a smooth SE kernel for a non-smooth objective), the bound does not hold. The bound also grows with input dimension dd through γT\gamma_T, making Bayesian optimization less efficient for high-dimensional search spaces (typically d>20d > 20).

Tree-Structured Parzen Estimator (TPE)

TPE (Bergstra et al., 2011) models p(xy)p(x \mid y) instead of p(yx)p(y \mid x). It splits observed hyperparameters into "good" (y<yy < y^*) and "bad" (yyy \geq y^*) groups, fits kernel density estimators l(x)l(x) and g(x)g(x) to each, and maximizes the ratio l(x)/g(x)l(x)/g(x).

Advantage over GP: TPE handles categorical and conditional hyperparameters naturally (e.g., "use Adam" vs "use SGD", where Adam has a parameter β1\beta_1 that SGD does not).

Disadvantage: TPE lacks the theoretical guarantees of GP-UCB. The density estimation can be poor with few observations.

Why Grid Search is Wasteful

Grid search evaluates a regular grid over the hyperparameter space. For dd hyperparameters with mm values each, it requires mdm^d evaluations. With d=5d = 5 and m=10m = 10, this is 100,000 evaluations.

Bergstra & Bengio (2012) showed that random search outperforms grid search in the same budget because: (1) in most problems, only a few hyperparameters matter, and (2) grid search wastes evaluations exploring unimportant dimensions. Random search projects uniformly onto each dimension, giving better coverage of the important ones.

Bayesian optimization further improves on random search by using past results to guide future evaluations.

Common Confusions

Watch Out

Bayesian optimization is not Bayesian hyperparameter inference

Bayesian optimization uses a Bayesian model (the GP) as a tool for optimization. It does not compute a posterior over hyperparameters. The GP models the objective function ff, not the hyperparameter distribution. The output is a point estimate xx^*, not a posterior.

Watch Out

GP-UCB and EI are not equivalent

Both balance exploration and exploitation, but differently. UCB has an explicit exploration parameter βt\beta_t that must be set. EI has no explicit parameter but can be myopic. They lead to different evaluation sequences and different theoretical guarantees.

Canonical Examples

Example

Tuning learning rate and weight decay

Objective: validation loss of a ResNet-50 as a function of learning rate η[105,101]\eta \in [10^{-5}, 10^{-1}] (log scale) and weight decay λ[106,102]\lambda \in [10^{-6}, 10^{-2}] (log scale). A GP with Matern-5/2 kernel over the log-transformed space. After 20 evaluations (each a full ImageNet training run), Bayesian optimization typically finds a configuration within 0.5% of the best found by 200 random search evaluations.

Summary

  • Bayesian optimization uses a GP surrogate to model the objective function
  • Acquisition functions (EI, UCB, PI) decide where to evaluate next
  • EI has a closed form and balances exploration/exploitation without tuning
  • GP-UCB has sublinear regret guarantees depending on kernel smoothness
  • TPE handles categorical hyperparameters but lacks theoretical guarantees
  • Grid search is exponential in dimension; random search is a better baseline
  • Bayesian optimization is most valuable when each evaluation is expensive

Exercises

ExerciseCore

Problem

You have budget for 30 evaluations of a function with 3 continuous hyperparameters. Compare the number of distinct values each hyperparameter gets under grid search (301/3=3\lfloor 30^{1/3} \rfloor = 3 values each, giving 27 evaluations) versus random search (30 random points). Which gives better coverage per dimension?

ExerciseAdvanced

Problem

Derive the Expected Improvement when σt(x)0\sigma_t(x) \to 0 and when μt(x)=fbest\mu_t(x) = f_{\text{best}}. Interpret both cases.

References

Canonical:

  • Srinivas et al., "Gaussian Process Optimization in the Bandit Setting" (GP-UCB, 2010), Sections 1-4
  • Bergstra & Bengio, "Random Search for Hyper-Parameter Optimization" (2012)

Current:

  • Frazier, "A Tutorial on Bayesian Optimization" (2018)
  • Bergstra et al., "Algorithms for Hyper-Parameter Optimization" (TPE, 2011)
  • Shahriari et al., "Taking the Human Out of the Loop" (2016), Sections 2-5

Next Topics

  • This topic builds on Gaussian processes for the surrogate model theory

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.