Multi-Armed Bandits Theory

Sneiderman, Robby

RL Theory

Multi-Armed Bandits Theory

The exploration-exploitation tradeoff formalized: K arms, regret as the cost of not knowing the best arm, and algorithms (UCB, Thompson sampling) that achieve near-optimal regret bounds.

CoreTier 2StableSupporting~55 min

Prerequisites

Common Probability Distributions Bayesian Optimization for Hyperparameters No Regret Learning Online Convex Optimization

Start 8-question practice · 5 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

rl-theory | layer 2 | tier 2. This page has 4 direct prerequisites and 4 published dependents.

Open Atlas Prerequisites Leads to

What next

Markov Decision Processes

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

The multi-armed bandit problem is the simplest formalization of the exploration-exploitation tradeoff. You have $K$ slot machines. Each pull gives a random reward from an unknown distribution. You want to maximize total reward over $T$ pulls. Pulling the best arm every time is optimal, but you do not know which arm is best. Every pull of a suboptimal arm costs you, but every pull of a not-yet-explored arm teaches you something.

Bandits sit at the foundation of reinforcement learning, clinical trial design, and adaptive A/B testing. Understanding the regret framework here is a prerequisite for understanding regret in full MDPs.

Formal Setup

Definition

K-Armed Bandit $(ν_{1}, \dots, ν_{K})$

A stochastic K-armed bandit consists of $K$ arms, each with an unknown reward distribution $\nu_i$ with mean $\mu_i$ . At each round $t = 1, \ldots, T$ , the learner selects arm $A_t \in \{1, \ldots, K\}$ and observes reward $X_t \sim \nu_{A_t}$ , independent of past rewards given the arm choice.

Let $\mu^* = \max_i \mu_i$ be the mean of the best arm. Let $\Delta_i = \mu^* - \mu_i$ be the gap for arm $i$ .

Definition

Cumulative Regret $R_{T}$

The cumulative regret after $T$ rounds is:

$R_T = T \mu^* - \sum_{t=1}^{T} \mu_{A_t} = \sum_{t=1}^{T} \Delta_{A_t}$

Equivalently, $R_T = \sum_{i=1}^{K} \Delta_i \, \mathbb{E}[N_i(T)]$ , where $N_i(T)$ is the number of times arm $i$ is pulled.

Regret measures the total cost of not knowing the best arm from the start. An algorithm with sublinear regret ( $R_T / T \to 0$ ) learns the best arm eventually.

The UCB Algorithm

The Upper Confidence Bound (UCB1) algorithm balances exploration and exploitation by maintaining a confidence interval for each arm's mean.

At round $t$ , pull the arm maximizing:

$A_t = \arg\max_{i \in \{1,\ldots,K\}} \left( \hat{\mu}_i(t) + \sqrt{\frac{2 \ln t}{N_i(t)}} \right)$

where $\hat{\mu}_i(t)$ is the sample mean of arm $i$ and $N_i(t)$ is its pull count so far. The second term is the exploration bonus: arms pulled less frequently have wider confidence intervals and get explored more.

Main Theorems

Theorem

UCB1 Regret Bound

Statement

The UCB1 algorithm achieves expected cumulative regret bounded by:

$\mathbb{E}[R_T] \leq \sum_{i: \Delta_i > 0} \left( \frac{8 \ln T}{\Delta_i} + (1 + \pi^2/3) \Delta_i \right)$

This is $O\!\left(\sum_i \frac{\ln T}{\Delta_i}\right)$ for fixed gaps. In the worst case over gap configurations, this gives $O(\sqrt{KT \ln T})$ .

Intuition

The bound says UCB pulls each suboptimal arm roughly $\ln T / \Delta_i^2$ times. Arms with small gaps are hard to distinguish from the best arm, so they get pulled more. The total cost is the number of suboptimal pulls times their gap. The logarithmic dependence on $T$ means regret grows very slowly.

Proof Sketch

Fix a suboptimal arm $i$ . After $s$ pulls of arm $i$ , the confidence width is $\sqrt{2 \ln t / s}$ . Once $s > 8 \ln T / \Delta_i^2$ , the upper confidence bound for arm $i$ falls below $\mu^*$ with high probability. So arm $i$ is pulled at most $O(\ln T / \Delta_i^2)$ times. The additional $\Delta_i$ terms handle the low-probability tail events.

Why It Matters

UCB shows that a simple, deterministic algorithm achieves near-optimal regret without any Bayesian assumptions. The regret is logarithmic in $T$ , matching the fundamental lower bound up to constants.

Failure Mode

The bound depends on $1/\Delta_i$ for each arm. When many arms have gaps close to zero, the bound becomes very large. In adversarial (non-stochastic) settings, UCB1 can achieve linear regret. Use EXP3 for adversarial bandits instead.

report a correction →

Theorem

Lai-Robbins Lower Bound

Statement

For any policy that achieves $o(T^a)$ regret for every bandit instance and every $a > 0$ , the expected number of pulls of each suboptimal arm $i$ satisfies:

$\liminf_{T \to \infty} \frac{\mathbb{E}[N_i(T)]}{\ln T} \geq \frac{1}{\mathrm{KL}(\nu_i \| \nu^*)}$

where $\mathrm{KL}(\nu_i \| \nu^*)$ is the KL divergence from the suboptimal arm's distribution to the best arm's distribution.

Intuition

You must pull each suboptimal arm at least $\Omega(\ln T)$ times, and the constant depends on how hard it is to distinguish that arm from the best arm (measured by KL divergence). Arms that look similar to the best arm require more exploration.

Proof Sketch

Change-of-measure argument. If arm $i$ were actually the best arm (swap the means), a consistent policy would need to pull arm $i$ many times. The KL divergence quantifies how many samples are needed to distinguish the two cases.

Why It Matters

This is the fundamental information-theoretic lower bound for bandits. It shows that UCB-type algorithms (with $\ln T$ regret) are rate-optimal. No algorithm can do better than $\Omega(\sum_i \ln T / \mathrm{KL}(\nu_i \| \nu^*))$ .

Failure Mode

The bound assumes stochastic, stationary rewards from exponential families. For adversarial or non-stationary environments, different lower bounds apply (minimax regret $\Omega(\sqrt{KT})$ for adversarial bandits).

report a correction →

Thompson Sampling

Thompson sampling takes a Bayesian approach. Maintain a posterior $P(\mu_i | \text{data})$ for each arm's mean. At each round, sample $\tilde{\mu}_i$ from the posterior for each arm and pull $A_t = \arg\max_i \tilde{\mu}_i$ .

For Bernoulli rewards with a Beta prior: start with $\text{Beta}(1,1)$ for each arm. After observing a reward of 1 or 0, update the Beta parameters. Sample from each posterior and pull the arm with the highest sample.

Thompson sampling also achieves the Lai-Robbins lower bound asymptotically. Empirically, it often outperforms UCB because its exploration is naturally calibrated to the posterior uncertainty.

Contextual Bandits

Definition

Contextual Bandit

At each round $t$ , the learner observes a context $x_t \in \mathcal{X}$ , then selects arm $a_t \in \{1, \ldots, K\}$ , and observes reward $r_t \sim \nu(x_t, a_t)$ . The optimal policy is $\pi^*(x) = \arg\max_a \mathbb{E}[r | x, a]$ .

Contextual bandits generalize the basic setting by conditioning arm rewards on side information. This is the formal model for personalized A/B testing: $x_t$ is the user's features, arms are treatment variants, and the goal is to learn the best variant for each user type.

LinUCB assumes $\mathbb{E}[r | x, a] = x^\top \theta_a$ and maintains confidence ellipsoids for each $\theta_a$ . It achieves regret $\tilde{O}(d\sqrt{T})$ where $d$ is the context dimension.

Common Confusions

Watch Out

Regret is not the same as simple regret

Cumulative regret measures the total cost of exploration across all rounds. Simple regret measures the quality of the arm you recommend after $T$ rounds. Minimizing cumulative regret requires balancing explore and exploit at every step. Minimizing simple regret allows pure exploration followed by a single recommendation. Different objectives, different algorithms.

Watch Out

Bandits are not full reinforcement learning

Bandits have no state transitions. The reward of pulling arm $i$ at round $t$ does not depend on past arm choices. Full RL adds state, where actions change the environment. Bandits are the stateless special case of MDPs.

Canonical Examples

Example

A/B testing as a 2-armed bandit

You test two website layouts. Layout A has conversion rate 5%, layout B has conversion rate 7%. Each visitor is a round. A traditional A/B test uses a 50/50 split for a fixed sample size, then picks the winner. This incurs $O(T)$ regret because half of visitors see the worse layout for the entire test. A bandit algorithm (UCB or Thompson sampling) shifts traffic toward B as evidence accumulates, achieving $O(\ln T)$ regret instead.

Summary

Cumulative regret $R_T = \sum_t \Delta_{A_t}$ is the central performance measure
UCB1 achieves $O(\sqrt{KT \ln T})$ worst-case regret and $O(\sum_i \ln T / \Delta_i)$ instance-dependent regret
Lai-Robbins lower bound shows $\Omega(\sum_i \ln T / \mathrm{KL}(\nu_i \| \nu^*))$ is unavoidable
Thompson sampling matches the lower bound and is often the practical default
Contextual bandits extend the framework to side information (personalization)

Exercises

ExerciseCore

Problem

Consider a 2-armed bandit where arm 1 has mean 0.6 and arm 2 has mean 0.4. What is the expected cumulative regret of a policy that pulls each arm with equal probability for all $T$ rounds?

ExerciseAdvanced

Problem

Show that the UCB1 exploration bonus $\sqrt{2 \ln t / N_i(t)}$ ensures that $\hat{\mu}_i + \sqrt{2 \ln t / N_i(t)} \geq \mu_i$ holds for all arms $i$ and all rounds $t$ with probability at least $1 - 1/t^4$ (using the Hoeffding bound).

References

Pre-canonical:

Thompson, "On the likelihood that one unknown probability exceeds another in view of the evidence of two samples", Biometrika 25:285-294 (1933). The original Thompson sampling posterior-matching heuristic, predating bandit formalism by two decades.
Robbins, "Some aspects of the sequential design of experiments", Bulletin of the AMS 58:527-535 (1952). Introduces the sequential allocation problem that becomes the stochastic bandit.
Auer, Cesa-Bianchi, and Fischer, "Finite-time Analysis of the Multiarmed Bandit Problem", Machine Learning 47:235-256 (2002). The UCB1 paper proved in the body; gives the finite-time $O(\log T)$ regret bound.

Canonical:

Lattimore & Szepesvari, Bandit Algorithms (2020), Chapters 7-8 (UCB), Chapter 36 (Thompson Sampling)
Lai & Robbins, "Asymptotically Efficient Adaptive Allocation Rules" (1985). The asymptotic lower bound that UCB matches.

Current:

Slivkins, Introduction to Multi-Armed Bandits (2019), Chapters 1-4
Russo et al., "A Tutorial on Thompson Sampling" (2018)
Agrawal and Goyal, "Analysis of Thompson Sampling for the Multi-armed Bandit Problem", COLT 2012 (and "Further Optimal Regret Bounds for Thompson Sampling", AISTATS 2013). First rigorous finite-time regret bounds for Thompson sampling, closing the gap between empirical performance and theory.

Next Topics

Markov decision processes: bandits with state transitions
Policy gradient theorem: gradient-based optimization when actions affect future states

Last reviewed: April 17, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Common Probability Distributionslayer 0A · tier 1
Bayesian Optimization for Hyperparameterslayer 3 · tier 2
No-Regret Learninglayer 3 · tier 2
Online Convex Optimizationlayer 3 · tier 2

Derived topics

4

Markov Decision Processeslayer 2 · tier 1
Policy Gradient Theoremlayer 3 · tier 1
Exploration vs Exploitationlayer 2 · tier 2
Agent-Based Modeling with MLlayer 4 · tier 3

Graph-backed continuations

Markov Decision Processes Policy Gradient Theorem Agent-Based Modeling with ML Exploration vs Exploitation