Online Learning and Bandits

Sneiderman, Robby

RL Theory

Online Learning and Bandits

Sequential decision making with adversarial or stochastic feedback: the bandit setting, explore-exploit tradeoff, UCB, Thompson sampling, and regret bounds. Connections to RL and A/B testing.

AdvancedTier 2StableSupporting~60 min

Prerequisites

No Regret Learning Adaptive Learning Is Not Iid Test Time Training

Start 8-question practice · 5 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

rl-theory | layer 3 | tier 2. This page has 3 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Markov Decision Processes

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

In supervised learning, you see the entire training set, optimize, and deploy. In online learning, you make decisions sequentially: at each round, you choose an action, observe a reward, and update your strategy. You never see what would have happened if you had chosen differently.

This is the setting of A/B testing (which variant to show each user?), recommender systems (which item to suggest?), clinical trials (which treatment to assign?), and ad placement (which ad to display?). The theory of bandits gives precise answers about how much exploration is necessary and what regret is unavoidable.

Mental Model

You face $K$ slot machines (arms). Each arm gives a random reward when pulled. You do not know the reward distributions. You have $T$ rounds. At each round, you pull one arm and observe its reward. You want to maximize total reward.

The tension: you should exploit the arm that seems best (based on what you have seen), but you should also explore other arms (in case a seemingly bad arm is actually better). This is the explore-exploit dilemma.

Formal Setup

Definition

Stochastic Multi-Armed Bandit

A stochastic $K$ -armed bandit consists of $K$ arms, each with an unknown reward distribution $\nu_k$ with mean $\mu_k$ . At each round $t = 1, \ldots, T$ :

The learner selects arm $A_t \in \{1, \ldots, K\}$
The learner observes reward $r_t \sim \nu_{A_t}$
The learner sees only $r_t$ , not the rewards of other arms

The optimal arm is $k^* = \arg\max_k \mu_k$ with mean $\mu^* = \mu_{k^*}$ .

Definition

Regret $R_{T}$

The cumulative regret after $T$ rounds is:

$R_T = T \mu^* - \sum_{t=1}^{T} \mu_{A_t} = \sum_{t=1}^{T} (\mu^* - \mu_{A_t})$

This is the expected total reward of always playing the best arm, minus the expected total reward of the learner's strategy. Equivalently:

$R_T = \sum_{k=1}^{K} \Delta_k \cdot \mathbb{E}[N_k(T)]$

where $\Delta_k = \mu^* - \mu_k$ is the suboptimality gap of arm $k$ and $N_k(T)$ is the number of times arm $k$ is pulled in $T$ rounds.

Definition

Bandit Feedback

In bandit feedback, the learner observes only the reward of the chosen action. This contrasts with full information (observing rewards of all actions) and expert advice (observing losses of all experts). Bandit feedback is strictly harder: less information per round means more exploration is needed.

Main Theorems

Theorem

Minimax Regret Lower Bound for Bandits

Statement

For any bandit algorithm, there exists a problem instance with $K$ arms and $T$ rounds such that:

$R_T \geq c \sqrt{KT}$

for a universal constant $c > 0$ (specifically, $c \geq 1/20$ ). No algorithm can achieve regret better than $\Omega(\sqrt{KT})$ in the worst case.

Intuition

With $K$ arms, you need to try each arm enough times to distinguish it from the best. Each suboptimal arm needs about $\sqrt{T/K}$ pulls to detect a gap of $\sqrt{K/T}$ . The total regret from these necessary explorations is $K \cdot \sqrt{T/K} \cdot \sqrt{K/T} = \sqrt{KT}$ .

Proof Sketch

Construct $K$ problem instances. In instance $k$ , arm $k$ has mean $1/2 + \epsilon$ and all other arms have mean $1/2$ . Any algorithm that performs well on one instance must pull the other arms, incurring regret on that instance. By a change-of-measure argument (KL divergence between the instances), the expected number of pulls of the suboptimal arms is lower-bounded, giving regret $\Omega(\sqrt{KT})$ after optimizing $\epsilon = \sqrt{K/T}$ .

Why It Matters

This lower bound sets the bar: $O(\sqrt{KT})$ regret is optimal up to constants. Algorithms achieving this rate are called minimax optimal. UCB and Thompson sampling both achieve this rate.

Failure Mode

This is a worst-case bound. For "easy" instances where the gaps $\Delta_k$ are large, much better regret is possible: $O(\sum_k \log T / \Delta_k)$ . The gap-dependent and gap-independent bounds address different questions (worst-case vs. instance-dependent).

report a correction →

Theorem

UCB1 Regret Bound

Statement

The UCB1 algorithm selects at each round $t$ the arm with the highest upper confidence bound:

$A_t = \arg\max_{k} \left( \hat{\mu}_k(t) + \sqrt{\frac{2 \ln t}{N_k(t)}} \right)$

where $\hat{\mu}_k(t)$ is the empirical mean of arm $k$ and $N_k(t)$ is its pull count. UCB1 achieves:

$R_T \leq \sum_{k : \Delta_k > 0} \frac{8 \ln T}{\Delta_k} + (1 + \tfrac{\pi^2}{3}) \sum_{k=1}^{K} \Delta_k$

This is $O(K \log T / \Delta_{\min})$ in the gap-dependent form and $O(\sqrt{KT \log T})$ in the gap-independent form.

Intuition

UCB implements "optimism in the face of uncertainty." It pretends each arm is as good as its best plausible value given the data. Arms that have been pulled many times have tight confidence intervals, so their UCB is close to their true mean. Arms that have been pulled few times have wide intervals, so their UCB is high. This naturally balances exploration (pulling uncertain arms) and exploitation (pulling arms with high empirical means).

Proof Sketch

A suboptimal arm $k$ is pulled only when its UCB exceeds $\mu^*$ . Using Hoeffding's inequality, $\hat{\mu}_k + \sqrt{2 \ln t / N_k} \geq \mu^*$ requires either $\hat{\mu}_k$ to be an overestimate of $\mu_k$ (unlikely for large $N_k$ ) or $N_k$ to be small enough that the confidence width $\sqrt{2 \ln t / N_k}$ exceeds $\Delta_k$ . Solving $\sqrt{2 \ln T / N_k} = \Delta_k$ gives $N_k \leq 8 \ln T / \Delta_k^2$ , which bounds the number of times arm $k$ is pulled.

Why It Matters

UCB1 is the simplest algorithm achieving near-optimal regret. It requires no hyperparameters (no exploration probability, no temperature). The confidence width automatically adapts: arms that are pulled often get exploited; arms pulled rarely get explored.

Failure Mode

UCB1 has a $\sqrt{\log T}$ excess over the minimax optimal $\sqrt{KT}$ . For very large $T$ , this matters. Also, UCB1 is designed for bounded rewards. For heavy-tailed distributions, robust variants (median-of-means UCB) are needed. For non-stationary environments, the confidence intervals are invalid and discounted or sliding-window UCB is required.

report a correction →

Thompson Sampling

Definition

Thompson Sampling

Thompson sampling maintains a posterior distribution $P(\mu_k | \text{data})$ for each arm's mean. At each round:

Sample $\tilde{\mu}_k \sim P(\mu_k | \text{data})$ for each arm
Play $A_t = \arg\max_k \tilde{\mu}_k$
Update the posterior with the observed reward

For Bernoulli rewards with Beta priors: start with $\text{Beta}(1,1)$ for each arm. After observing $s$ successes and $f$ failures from arm $k$ , the posterior is $\text{Beta}(1 + s, 1 + f)$ .

Theorem

Thompson Sampling Bayesian Regret

Statement

Thompson sampling achieves Bayesian regret:

$\text{BayesRegret}(T) \leq C\sqrt{KT \ln T}$

for a constant $C$ independent of the prior. For Bernoulli bandits with Beta priors, it also achieves the gap-dependent bound:

$R_T \leq O\left(\sum_{k : \Delta_k > 0} \frac{\ln T}{\Delta_k}\right)$

matching UCB1 up to constants.

Intuition

Arms with uncertain posteriors sometimes produce high samples, causing exploration. Arms with well-estimated posteriors produce samples near their true means, causing exploitation. The stochasticity of sampling automatically calibrates the exploration rate to the posterior uncertainty.

Proof Sketch

The key step is bounding the probability that Thompson sampling plays a suboptimal arm. This probability is at most the posterior probability that the suboptimal arm is optimal. As data accumulates, this posterior probability shrinks, limiting the number of suboptimal plays. The proof by Agrawal and Goyal (2012) uses a careful anti-concentration argument on the posterior.

Why It Matters

Thompson sampling is often the algorithm of choice in practice. It is simple to implement, adapts naturally to the reward structure, and empirically outperforms UCB on many problem instances despite having the same theoretical guarantees. It also generalizes naturally to structured bandits and contextual bandits.

Failure Mode

Thompson sampling requires a prior and a likelihood model. If the model is misspecified (e.g., assuming Gaussian rewards when they are heavy-tailed), the posterior concentrates on the wrong values and performance degrades. The Bayesian regret guarantee is with respect to the assumed prior, not worst-case.

report a correction →

Connection to RL and A/B Testing

Multi-armed bandits are the simplest case of reinforcement learning: one state, $K$ actions, immediate rewards. Full RL adds states and delayed rewards.

In A/B testing, the standard approach assigns users to variants uniformly and waits until statistical significance. Bandit algorithms (UCB, Thompson sampling) assign more users to better variants during the test, reducing total regret. The cost is that the statistical analysis is more complex.

Common Confusions

Watch Out

Bandit regret is not classification error

Regret measures the cumulative cost of exploration over $T$ rounds relative to always playing the best arm. Classification error measures the mistake rate at a single time point. A bandit algorithm with high regret in the first 100 rounds but optimal play afterward has low long-term average regret.

Watch Out

Epsilon-greedy is not minimax optimal

Epsilon-greedy (play the best arm with probability $1 - \epsilon$ , random arm with probability $\epsilon$ ) achieves $O(\epsilon T + K/\epsilon)$ regret. Optimizing $\epsilon = \sqrt{K/T}$ gives $O(\sqrt{KT})$ , but this requires knowing $T$ in advance. Decaying epsilon-greedy can match, but UCB and Thompson sampling achieve the same rate without tuning.

Exercises

ExerciseCore

Problem

You run a 3-armed bandit for $T = 10{,}000$ rounds. The arms have true means $\mu_1 = 0.7$ , $\mu_2 = 0.5$ , $\mu_3 = 0.3$ . What is the maximum possible cumulative regret? What is the regret of always pulling arm 2?

ExerciseAdvanced

Problem

Derive the UCB1 confidence width. Starting from Hoeffding's inequality for bounded $[0,1]$ random variables, show that $\sqrt{2 \ln t / N_k}$ is the correct width for a $1/t^4$ failure probability per arm per round.

ExerciseResearch

Problem

Explain why the minimax lower bound is $\Omega(\sqrt{KT})$ but the UCB1 gap-dependent bound can be much better, $O(\sum_k \log T / \Delta_k)$ . When is each bound tighter? Construct an instance where the gap-dependent bound is $O(\log T)$ and one where it is $\Omega(\sqrt{KT})$ .

References

Canonical:

Lattimore & Szepesvari, Bandit Algorithms (2020), Chapters 6-8, 35-36
Auer, Cesa-Bianchi, Fischer, Finite-time Analysis of the Multiarmed Bandit Problem (2002)

Current:

Agrawal & Goyal, Analysis of Thompson Sampling for the Multi-armed Bandit Problem (2012)
Russo et al., A Tutorial on Thompson Sampling (2018)

Next Topics

Markov decision processes: extending bandits to sequential states and delayed rewards
Policy gradient theorem: gradient-based optimization for sequential decision problems

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Adaptive Learning Is Not IIDlayer 3 · tier 2
No-Regret Learninglayer 3 · tier 2
Test-Time Training and Adaptive Inferencelayer 5 · tier 2

Derived topics

2

Markov Decision Processeslayer 2 · tier 1
Policy Gradient Theoremlayer 3 · tier 1

Graph-backed continuations

Markov Decision Processes Policy Gradient Theorem