Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

RL Theory

Online Learning and Bandits

Sequential decision making with adversarial or stochastic feedback: the bandit setting, explore-exploit tradeoff, UCB, Thompson sampling, and regret bounds. Connections to RL and A/B testing.

AdvancedTier 2Stable~60 min

Prerequisites

0

Why This Matters

In supervised learning, you see the entire training set, optimize, and deploy. In online learning, you make decisions sequentially: at each round, you choose an action, observe a reward, and update your strategy. You never see what would have happened if you had chosen differently.

This is the setting of A/B testing (which variant to show each user?), recommender systems (which item to suggest?), clinical trials (which treatment to assign?), and ad placement (which ad to display?). The theory of bandits gives precise answers about how much exploration is necessary and what regret is unavoidable.

Mental Model

You face KK slot machines (arms). Each arm gives a random reward when pulled. You do not know the reward distributions. You have TT rounds. At each round, you pull one arm and observe its reward. You want to maximize total reward.

The tension: you should exploit the arm that seems best (based on what you have seen), but you should also explore other arms (in case a seemingly bad arm is actually better). This is the explore-exploit dilemma.

Formal Setup

Definition

Stochastic Multi-Armed Bandit

A stochastic KK-armed bandit consists of KK arms, each with an unknown reward distribution νk\nu_k with mean μk\mu_k. At each round t=1,,Tt = 1, \ldots, T:

  1. The learner selects arm At{1,,K}A_t \in \{1, \ldots, K\}
  2. The learner observes reward rtνAtr_t \sim \nu_{A_t}
  3. The learner sees only rtr_t, not the rewards of other arms

The optimal arm is k=argmaxkμkk^* = \arg\max_k \mu_k with mean μ=μk\mu^* = \mu_{k^*}.

Definition

Regret

The cumulative regret after TT rounds is:

RT=Tμt=1TμAt=t=1T(μμAt)R_T = T \mu^* - \sum_{t=1}^{T} \mu_{A_t} = \sum_{t=1}^{T} (\mu^* - \mu_{A_t})

This is the expected total reward of always playing the best arm, minus the expected total reward of the learner's strategy. Equivalently:

RT=k=1KΔkE[Nk(T)]R_T = \sum_{k=1}^{K} \Delta_k \cdot \mathbb{E}[N_k(T)]

where Δk=μμk\Delta_k = \mu^* - \mu_k is the suboptimality gap of arm kk and Nk(T)N_k(T) is the number of times arm kk is pulled in TT rounds.

Definition

Bandit Feedback

In bandit feedback, the learner observes only the reward of the chosen action. This contrasts with full information (observing rewards of all actions) and expert advice (observing losses of all experts). Bandit feedback is strictly harder: less information per round means more exploration is needed.

Main Theorems

Theorem

Minimax Regret Lower Bound for Bandits

Statement

For any bandit algorithm, there exists a problem instance with KK arms and TT rounds such that:

RTcKTR_T \geq c \sqrt{KT}

for a universal constant c>0c > 0 (specifically, c1/20c \geq 1/20). No algorithm can achieve regret better than Ω(KT)\Omega(\sqrt{KT}) in the worst case.

Intuition

With KK arms, you need to try each arm enough times to distinguish it from the best. Each suboptimal arm needs about T/K\sqrt{T/K} pulls to detect a gap of K/T\sqrt{K/T}. The total regret from these necessary explorations is KT/KK/T=KTK \cdot \sqrt{T/K} \cdot \sqrt{K/T} = \sqrt{KT}.

Proof Sketch

Construct KK problem instances. In instance kk, arm kk has mean 1/2+ϵ1/2 + \epsilon and all other arms have mean 1/21/2. Any algorithm that performs well on one instance must pull the other arms, incurring regret on that instance. By a change-of-measure argument (KL divergence between the instances), the expected number of pulls of the suboptimal arms is lower-bounded, giving regret Ω(KT)\Omega(\sqrt{KT}) after optimizing ϵ=K/T\epsilon = \sqrt{K/T}.

Why It Matters

This lower bound sets the bar: O(KT)O(\sqrt{KT}) regret is optimal up to constants. Algorithms achieving this rate are called minimax optimal. UCB and Thompson sampling both achieve this rate.

Failure Mode

This is a worst-case bound. For "easy" instances where the gaps Δk\Delta_k are large, much better regret is possible: O(klogT/Δk)O(\sum_k \log T / \Delta_k). The gap-dependent and gap-independent bounds address different questions (worst-case vs. instance-dependent).

Theorem

UCB1 Regret Bound

Statement

The UCB1 algorithm selects at each round tt the arm with the highest upper confidence bound:

At=argmaxk(μ^k(t)+2lntNk(t))A_t = \arg\max_{k} \left( \hat{\mu}_k(t) + \sqrt{\frac{2 \ln t}{N_k(t)}} \right)

where μ^k(t)\hat{\mu}_k(t) is the empirical mean of arm kk and Nk(t)N_k(t) is its pull count. UCB1 achieves:

RTk:Δk>08lnTΔk+(1+π23)k=1KΔkR_T \leq \sum_{k : \Delta_k > 0} \frac{8 \ln T}{\Delta_k} + (1 + \tfrac{\pi^2}{3}) \sum_{k=1}^{K} \Delta_k

This is O(KlogT/Δmin)O(K \log T / \Delta_{\min}) in the gap-dependent form and O(KTlogT)O(\sqrt{KT \log T}) in the gap-independent form.

Intuition

UCB implements "optimism in the face of uncertainty." It pretends each arm is as good as its best plausible value given the data. Arms that have been pulled many times have tight confidence intervals, so their UCB is close to their true mean. Arms that have been pulled few times have wide intervals, so their UCB is high. This naturally balances exploration (pulling uncertain arms) and exploitation (pulling arms with high empirical means).

Proof Sketch

A suboptimal arm kk is pulled only when its UCB exceeds μ\mu^*. Using Hoeffding's inequality, μ^k+2lnt/Nkμ\hat{\mu}_k + \sqrt{2 \ln t / N_k} \geq \mu^* requires either μ^k\hat{\mu}_k to be an overestimate of μk\mu_k (unlikely for large NkN_k) or NkN_k to be small enough that the confidence width 2lnt/Nk\sqrt{2 \ln t / N_k} exceeds Δk\Delta_k. Solving 2lnT/Nk=Δk\sqrt{2 \ln T / N_k} = \Delta_k gives Nk8lnT/Δk2N_k \leq 8 \ln T / \Delta_k^2, which bounds the number of times arm kk is pulled.

Why It Matters

UCB1 is the simplest algorithm achieving near-optimal regret. It requires no hyperparameters (no exploration probability, no temperature). The confidence width automatically adapts: arms that are pulled often get exploited; arms pulled rarely get explored.

Failure Mode

UCB1 has a logT\sqrt{\log T} excess over the minimax optimal KT\sqrt{KT}. For very large TT, this matters. Also, UCB1 is designed for bounded rewards. For heavy-tailed distributions, robust variants (median-of-means UCB) are needed. For non-stationary environments, the confidence intervals are invalid and discounted or sliding-window UCB is required.

Thompson Sampling

Definition

Thompson Sampling

Thompson sampling maintains a posterior distribution P(μkdata)P(\mu_k | \text{data}) for each arm's mean. At each round:

  1. Sample μ~kP(μkdata)\tilde{\mu}_k \sim P(\mu_k | \text{data}) for each arm
  2. Play At=argmaxkμ~kA_t = \arg\max_k \tilde{\mu}_k
  3. Update the posterior with the observed reward

For Bernoulli rewards with Beta priors: start with Beta(1,1)\text{Beta}(1,1) for each arm. After observing ss successes and ff failures from arm kk, the posterior is Beta(1+s,1+f)\text{Beta}(1 + s, 1 + f).

Theorem

Thompson Sampling Bayesian Regret

Statement

Thompson sampling achieves Bayesian regret:

BayesRegret(T)CKTlnT\text{BayesRegret}(T) \leq C\sqrt{KT \ln T}

for a constant CC independent of the prior. For Bernoulli bandits with Beta priors, it also achieves the gap-dependent bound:

RTO(k:Δk>0lnTΔk)R_T \leq O\left(\sum_{k : \Delta_k > 0} \frac{\ln T}{\Delta_k}\right)

matching UCB1 up to constants.

Intuition

Arms with uncertain posteriors sometimes produce high samples, causing exploration. Arms with well-estimated posteriors produce samples near their true means, causing exploitation. The stochasticity of sampling automatically calibrates the exploration rate to the posterior uncertainty.

Proof Sketch

The key step is bounding the probability that Thompson sampling plays a suboptimal arm. This probability is at most the posterior probability that the suboptimal arm is optimal. As data accumulates, this posterior probability shrinks, limiting the number of suboptimal plays. The proof by Agrawal and Goyal (2012) uses a careful anti-concentration argument on the posterior.

Why It Matters

Thompson sampling is often the algorithm of choice in practice. It is simple to implement, adapts naturally to the reward structure, and empirically outperforms UCB on many problem instances despite having the same theoretical guarantees. It also generalizes naturally to structured bandits and contextual bandits.

Failure Mode

Thompson sampling requires a prior and a likelihood model. If the model is misspecified (e.g., assuming Gaussian rewards when they are heavy-tailed), the posterior concentrates on the wrong values and performance degrades. The Bayesian regret guarantee is with respect to the assumed prior, not worst-case.

Connection to RL and A/B Testing

Multi-armed bandits are the simplest case of reinforcement learning: one state, KK actions, immediate rewards. Full RL adds states and delayed rewards.

In A/B testing, the standard approach assigns users to variants uniformly and waits until statistical significance. Bandit algorithms (UCB, Thompson sampling) assign more users to better variants during the test, reducing total regret. The cost is that the statistical analysis is more complex.

Common Confusions

Watch Out

Bandit regret is not classification error

Regret measures the cumulative cost of exploration over TT rounds relative to always playing the best arm. Classification error measures the mistake rate at a single time point. A bandit algorithm with high regret in the first 100 rounds but optimal play afterward has low long-term average regret.

Watch Out

Epsilon-greedy is not minimax optimal

Epsilon-greedy (play the best arm with probability 1ϵ1 - \epsilon, random arm with probability ϵ\epsilon) achieves O(ϵT+K/ϵ)O(\epsilon T + K/\epsilon) regret. Optimizing ϵ=K/T\epsilon = \sqrt{K/T} gives O(KT)O(\sqrt{KT}), but this requires knowing TT in advance. Decaying epsilon-greedy can match, but UCB and Thompson sampling achieve the same rate without tuning.

Exercises

ExerciseCore

Problem

You run a 3-armed bandit for T=10,000T = 10{,}000 rounds. The arms have true means μ1=0.7\mu_1 = 0.7, μ2=0.5\mu_2 = 0.5, μ3=0.3\mu_3 = 0.3. What is the maximum possible cumulative regret? What is the regret of always pulling arm 2?

ExerciseAdvanced

Problem

Derive the UCB1 confidence width. Starting from Hoeffding's inequality for bounded [0,1][0,1] random variables, show that 2lnt/Nk\sqrt{2 \ln t / N_k} is the correct width for a 1/t41/t^4 failure probability per arm per round.

ExerciseResearch

Problem

Explain why the minimax lower bound is Ω(KT)\Omega(\sqrt{KT}) but the UCB1 gap-dependent bound can be much better, O(klogT/Δk)O(\sum_k \log T / \Delta_k). When is each bound tighter? Construct an instance where the gap-dependent bound is O(logT)O(\log T) and one where it is Ω(KT)\Omega(\sqrt{KT}).

References

Canonical:

  • Lattimore & Szepesvari, Bandit Algorithms (2020), Chapters 6-8, 35-36
  • Auer, Cesa-Bianchi, Fischer, Finite-time Analysis of the Multiarmed Bandit Problem (2002)

Current:

  • Agrawal & Goyal, Analysis of Thompson Sampling for the Multi-armed Bandit Problem (2012)
  • Russo et al., A Tutorial on Thompson Sampling (2018)

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics