Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

RL Theory

Multi-Armed Bandits Theory

The exploration-exploitation tradeoff formalized: K arms, regret as the cost of not knowing the best arm, and algorithms (UCB, Thompson sampling) that achieve near-optimal regret bounds.

CoreTier 2Stable~55 min
0

Why This Matters

The multi-armed bandit problem is the simplest formalization of the exploration-exploitation tradeoff. You have KK slot machines. Each pull gives a random reward from an unknown distribution. You want to maximize total reward over TT pulls. Pulling the best arm every time is optimal, but you do not know which arm is best. Every pull of a suboptimal arm costs you, but every pull of a not-yet-explored arm teaches you something.

Bandits sit at the foundation of reinforcement learning, clinical trial design, and adaptive A/B testing. Understanding the regret framework here is a prerequisite for understanding regret in full MDPs.

Formal Setup

Definition

K-Armed Bandit

A stochastic K-armed bandit consists of KK arms, each with an unknown reward distribution νi\nu_i with mean μi\mu_i. At each round t=1,,Tt = 1, \ldots, T, the learner selects arm At{1,,K}A_t \in \{1, \ldots, K\} and observes reward XtνAtX_t \sim \nu_{A_t}, independent of past rewards given the arm choice.

Let μ=maxiμi\mu^* = \max_i \mu_i be the mean of the best arm. Let Δi=μμi\Delta_i = \mu^* - \mu_i be the gap for arm ii.

Definition

Cumulative Regret

The cumulative regret after TT rounds is:

RT=Tμt=1TμAt=t=1TΔAtR_T = T \mu^* - \sum_{t=1}^{T} \mu_{A_t} = \sum_{t=1}^{T} \Delta_{A_t}

Equivalently, RT=i=1KΔiE[Ni(T)]R_T = \sum_{i=1}^{K} \Delta_i \, \mathbb{E}[N_i(T)], where Ni(T)N_i(T) is the number of times arm ii is pulled.

Regret measures the total cost of not knowing the best arm from the start. An algorithm with sublinear regret (RT/T0R_T / T \to 0) learns the best arm eventually.

The UCB Algorithm

The Upper Confidence Bound (UCB1) algorithm balances exploration and exploitation by maintaining a confidence interval for each arm's mean.

At round tt, pull the arm maximizing:

At=argmaxi{1,,K}(μ^i(t)+2lntNi(t))A_t = \arg\max_{i \in \{1,\ldots,K\}} \left( \hat{\mu}_i(t) + \sqrt{\frac{2 \ln t}{N_i(t)}} \right)

where μ^i(t)\hat{\mu}_i(t) is the sample mean of arm ii and Ni(t)N_i(t) is its pull count so far. The second term is the exploration bonus: arms pulled less frequently have wider confidence intervals and get explored more.

Main Theorems

Theorem

UCB1 Regret Bound

Statement

The UCB1 algorithm achieves expected cumulative regret bounded by:

E[RT]i:Δi>0(8lnTΔi+(1+π2/3)Δi)\mathbb{E}[R_T] \leq \sum_{i: \Delta_i > 0} \left( \frac{8 \ln T}{\Delta_i} + (1 + \pi^2/3) \Delta_i \right)

This is O ⁣(ilnTΔi)O\!\left(\sum_i \frac{\ln T}{\Delta_i}\right) for fixed gaps. In the worst case over gap configurations, this gives O(KTlnT)O(\sqrt{KT \ln T}).

Intuition

The bound says UCB pulls each suboptimal arm roughly lnT/Δi2\ln T / \Delta_i^2 times. Arms with small gaps are hard to distinguish from the best arm, so they get pulled more. The total cost is the number of suboptimal pulls times their gap. The logarithmic dependence on TT means regret grows very slowly.

Proof Sketch

Fix a suboptimal arm ii. After ss pulls of arm ii, the confidence width is 2lnt/s\sqrt{2 \ln t / s}. Once s>8lnT/Δi2s > 8 \ln T / \Delta_i^2, the upper confidence bound for arm ii falls below μ\mu^* with high probability. So arm ii is pulled at most O(lnT/Δi2)O(\ln T / \Delta_i^2) times. The additional Δi\Delta_i terms handle the low-probability tail events.

Why It Matters

UCB shows that a simple, deterministic algorithm achieves near-optimal regret without any Bayesian assumptions. The regret is logarithmic in TT, matching the fundamental lower bound up to constants.

Failure Mode

The bound depends on 1/Δi1/\Delta_i for each arm. When many arms have gaps close to zero, the bound becomes very large. In adversarial (non-stochastic) settings, UCB1 can achieve linear regret. Use EXP3 for adversarial bandits instead.

Theorem

Lai-Robbins Lower Bound

Statement

For any policy that achieves o(Ta)o(T^a) regret for every bandit instance and every a>0a > 0, the expected number of pulls of each suboptimal arm ii satisfies:

lim infTE[Ni(T)]lnT1KL(νiν)\liminf_{T \to \infty} \frac{\mathbb{E}[N_i(T)]}{\ln T} \geq \frac{1}{\mathrm{KL}(\nu_i \| \nu^*)}

where KL(νiν)\mathrm{KL}(\nu_i \| \nu^*) is the KL divergence from the suboptimal arm's distribution to the best arm's distribution.

Intuition

You must pull each suboptimal arm at least Ω(lnT)\Omega(\ln T) times, and the constant depends on how hard it is to distinguish that arm from the best arm (measured by KL divergence). Arms that look similar to the best arm require more exploration.

Proof Sketch

Change-of-measure argument. If arm ii were actually the best arm (swap the means), a consistent policy would need to pull arm ii many times. The KL divergence quantifies how many samples are needed to distinguish the two cases.

Why It Matters

This is the fundamental information-theoretic lower bound for bandits. It shows that UCB-type algorithms (with lnT\ln T regret) are rate-optimal. No algorithm can do better than Ω(ilnT/KL(νiν))\Omega(\sum_i \ln T / \mathrm{KL}(\nu_i \| \nu^*)).

Failure Mode

The bound assumes stochastic, stationary rewards from exponential families. For adversarial or non-stationary environments, different lower bounds apply (minimax regret Ω(KT)\Omega(\sqrt{KT}) for adversarial bandits).

Thompson Sampling

Thompson sampling takes a Bayesian approach. Maintain a posterior P(μidata)P(\mu_i | \text{data}) for each arm's mean. At each round, sample μ~i\tilde{\mu}_i from the posterior for each arm and pull At=argmaxiμ~iA_t = \arg\max_i \tilde{\mu}_i.

For Bernoulli rewards with a Beta prior: start with Beta(1,1)\text{Beta}(1,1) for each arm. After observing a reward of 1 or 0, update the Beta parameters. Sample from each posterior and pull the arm with the highest sample.

Thompson sampling also achieves the Lai-Robbins lower bound asymptotically. Empirically, it often outperforms UCB because its exploration is naturally calibrated to the posterior uncertainty.

Contextual Bandits

Definition

Contextual Bandit

At each round tt, the learner observes a context xtXx_t \in \mathcal{X}, then selects arm at{1,,K}a_t \in \{1, \ldots, K\}, and observes reward rtν(xt,at)r_t \sim \nu(x_t, a_t). The optimal policy is π(x)=argmaxaE[rx,a]\pi^*(x) = \arg\max_a \mathbb{E}[r | x, a].

Contextual bandits generalize the basic setting by conditioning arm rewards on side information. This is the formal model for personalized A/B testing: xtx_t is the user's features, arms are treatment variants, and the goal is to learn the best variant for each user type.

LinUCB assumes E[rx,a]=xθa\mathbb{E}[r | x, a] = x^\top \theta_a and maintains confidence ellipsoids for each θa\theta_a. It achieves regret O~(dT)\tilde{O}(d\sqrt{T}) where dd is the context dimension.

Common Confusions

Watch Out

Regret is not the same as simple regret

Cumulative regret measures the total cost of exploration across all rounds. Simple regret measures the quality of the arm you recommend after TT rounds. Minimizing cumulative regret requires balancing explore and exploit at every step. Minimizing simple regret allows pure exploration followed by a single recommendation. Different objectives, different algorithms.

Watch Out

Bandits are not full reinforcement learning

Bandits have no state transitions. The reward of pulling arm ii at round tt does not depend on past arm choices. Full RL adds state, where actions change the environment. Bandits are the stateless special case of MDPs.

Canonical Examples

Example

A/B testing as a 2-armed bandit

You test two website layouts. Layout A has conversion rate 5%, layout B has conversion rate 7%. Each visitor is a round. A traditional A/B test uses a 50/50 split for a fixed sample size, then picks the winner. This incurs O(T)O(T) regret because half of visitors see the worse layout for the entire test. A bandit algorithm (UCB or Thompson sampling) shifts traffic toward B as evidence accumulates, achieving O(lnT)O(\ln T) regret instead.

Key Takeaways

  • Cumulative regret RT=tΔAtR_T = \sum_t \Delta_{A_t} is the central performance measure
  • UCB1 achieves O(KTlnT)O(\sqrt{KT \ln T}) worst-case regret and O(ilnT/Δi)O(\sum_i \ln T / \Delta_i) instance-dependent regret
  • Lai-Robbins lower bound shows Ω(ilnT/KL(νiν))\Omega(\sum_i \ln T / \mathrm{KL}(\nu_i \| \nu^*)) is unavoidable
  • Thompson sampling matches the lower bound and is often the practical default
  • Contextual bandits extend the framework to side information (personalization)

Exercises

ExerciseCore

Problem

Consider a 2-armed bandit where arm 1 has mean 0.6 and arm 2 has mean 0.4. What is the expected cumulative regret of a policy that pulls each arm with equal probability for all TT rounds?

ExerciseAdvanced

Problem

Show that the UCB1 exploration bonus 2lnt/Ni(t)\sqrt{2 \ln t / N_i(t)} ensures that μ^i+2lnt/Ni(t)μi\hat{\mu}_i + \sqrt{2 \ln t / N_i(t)} \geq \mu_i holds for all arms ii and all rounds tt with probability at least 11/t41 - 1/t^4 (using the Hoeffding bound).

References

Canonical:

  • Lattimore & Szepesvari, Bandit Algorithms (2020), Chapters 7-8 (UCB), Chapter 36 (Thompson Sampling)
  • Lai & Robbins, "Asymptotically Efficient Adaptive Allocation Rules" (1985)

Current:

  • Slivkins, Introduction to Multi-Armed Bandits (2019), Chapters 1-4
  • Russo et al., "A Tutorial on Thompson Sampling" (2018)

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics