RL Theory
Online Learning and Bandits
Sequential decision making with adversarial or stochastic feedback: the bandit setting, explore-exploit tradeoff, UCB, Thompson sampling, and regret bounds. Connections to RL and A/B testing.
Prerequisites
Why This Matters
In supervised learning, you see the entire training set, optimize, and deploy. In online learning, you make decisions sequentially: at each round, you choose an action, observe a reward, and update your strategy. You never see what would have happened if you had chosen differently.
This is the setting of A/B testing (which variant to show each user?), recommender systems (which item to suggest?), clinical trials (which treatment to assign?), and ad placement (which ad to display?). The theory of bandits gives precise answers about how much exploration is necessary and what regret is unavoidable.
Mental Model
You face slot machines (arms). Each arm gives a random reward when pulled. You do not know the reward distributions. You have rounds. At each round, you pull one arm and observe its reward. You want to maximize total reward.
The tension: you should exploit the arm that seems best (based on what you have seen), but you should also explore other arms (in case a seemingly bad arm is actually better). This is the explore-exploit dilemma.
Formal Setup
Stochastic Multi-Armed Bandit
A stochastic -armed bandit consists of arms, each with an unknown reward distribution with mean . At each round :
- The learner selects arm
- The learner observes reward
- The learner sees only , not the rewards of other arms
The optimal arm is with mean .
Regret
The cumulative regret after rounds is:
This is the expected total reward of always playing the best arm, minus the expected total reward of the learner's strategy. Equivalently:
where is the suboptimality gap of arm and is the number of times arm is pulled in rounds.
Bandit Feedback
In bandit feedback, the learner observes only the reward of the chosen action. This contrasts with full information (observing rewards of all actions) and expert advice (observing losses of all experts). Bandit feedback is strictly harder: less information per round means more exploration is needed.
Main Theorems
Minimax Regret Lower Bound for Bandits
Statement
For any bandit algorithm, there exists a problem instance with arms and rounds such that:
for a universal constant (specifically, ). No algorithm can achieve regret better than in the worst case.
Intuition
With arms, you need to try each arm enough times to distinguish it from the best. Each suboptimal arm needs about pulls to detect a gap of . The total regret from these necessary explorations is .
Proof Sketch
Construct problem instances. In instance , arm has mean and all other arms have mean . Any algorithm that performs well on one instance must pull the other arms, incurring regret on that instance. By a change-of-measure argument (KL divergence between the instances), the expected number of pulls of the suboptimal arms is lower-bounded, giving regret after optimizing .
Why It Matters
This lower bound sets the bar: regret is optimal up to constants. Algorithms achieving this rate are called minimax optimal. UCB and Thompson sampling both achieve this rate.
Failure Mode
This is a worst-case bound. For "easy" instances where the gaps are large, much better regret is possible: . The gap-dependent and gap-independent bounds address different questions (worst-case vs. instance-dependent).
UCB1 Regret Bound
Statement
The UCB1 algorithm selects at each round the arm with the highest upper confidence bound:
where is the empirical mean of arm and is its pull count. UCB1 achieves:
This is in the gap-dependent form and in the gap-independent form.
Intuition
UCB implements "optimism in the face of uncertainty." It pretends each arm is as good as its best plausible value given the data. Arms that have been pulled many times have tight confidence intervals, so their UCB is close to their true mean. Arms that have been pulled few times have wide intervals, so their UCB is high. This naturally balances exploration (pulling uncertain arms) and exploitation (pulling arms with high empirical means).
Proof Sketch
A suboptimal arm is pulled only when its UCB exceeds . Using Hoeffding's inequality, requires either to be an overestimate of (unlikely for large ) or to be small enough that the confidence width exceeds . Solving gives , which bounds the number of times arm is pulled.
Why It Matters
UCB1 is the simplest algorithm achieving near-optimal regret. It requires no hyperparameters (no exploration probability, no temperature). The confidence width automatically adapts: arms that are pulled often get exploited; arms pulled rarely get explored.
Failure Mode
UCB1 has a excess over the minimax optimal . For very large , this matters. Also, UCB1 is designed for bounded rewards. For heavy-tailed distributions, robust variants (median-of-means UCB) are needed. For non-stationary environments, the confidence intervals are invalid and discounted or sliding-window UCB is required.
Thompson Sampling
Thompson Sampling
Thompson sampling maintains a posterior distribution for each arm's mean. At each round:
- Sample for each arm
- Play
- Update the posterior with the observed reward
For Bernoulli rewards with Beta priors: start with for each arm. After observing successes and failures from arm , the posterior is .
Thompson Sampling Bayesian Regret
Statement
Thompson sampling achieves Bayesian regret:
for a constant independent of the prior. For Bernoulli bandits with Beta priors, it also achieves the gap-dependent bound:
matching UCB1 up to constants.
Intuition
Arms with uncertain posteriors sometimes produce high samples, causing exploration. Arms with well-estimated posteriors produce samples near their true means, causing exploitation. The stochasticity of sampling automatically calibrates the exploration rate to the posterior uncertainty.
Proof Sketch
The key step is bounding the probability that Thompson sampling plays a suboptimal arm. This probability is at most the posterior probability that the suboptimal arm is optimal. As data accumulates, this posterior probability shrinks, limiting the number of suboptimal plays. The proof by Agrawal and Goyal (2012) uses a careful anti-concentration argument on the posterior.
Why It Matters
Thompson sampling is often the algorithm of choice in practice. It is simple to implement, adapts naturally to the reward structure, and empirically outperforms UCB on many problem instances despite having the same theoretical guarantees. It also generalizes naturally to structured bandits and contextual bandits.
Failure Mode
Thompson sampling requires a prior and a likelihood model. If the model is misspecified (e.g., assuming Gaussian rewards when they are heavy-tailed), the posterior concentrates on the wrong values and performance degrades. The Bayesian regret guarantee is with respect to the assumed prior, not worst-case.
Connection to RL and A/B Testing
Multi-armed bandits are the simplest case of reinforcement learning: one state, actions, immediate rewards. Full RL adds states and delayed rewards.
In A/B testing, the standard approach assigns users to variants uniformly and waits until statistical significance. Bandit algorithms (UCB, Thompson sampling) assign more users to better variants during the test, reducing total regret. The cost is that the statistical analysis is more complex.
Common Confusions
Bandit regret is not classification error
Regret measures the cumulative cost of exploration over rounds relative to always playing the best arm. Classification error measures the mistake rate at a single time point. A bandit algorithm with high regret in the first 100 rounds but optimal play afterward has low long-term average regret.
Epsilon-greedy is not minimax optimal
Epsilon-greedy (play the best arm with probability , random arm with probability ) achieves regret. Optimizing gives , but this requires knowing in advance. Decaying epsilon-greedy can match, but UCB and Thompson sampling achieve the same rate without tuning.
Exercises
Problem
You run a 3-armed bandit for rounds. The arms have true means , , . What is the maximum possible cumulative regret? What is the regret of always pulling arm 2?
Problem
Derive the UCB1 confidence width. Starting from Hoeffding's inequality for bounded random variables, show that is the correct width for a failure probability per arm per round.
Problem
Explain why the minimax lower bound is but the UCB1 gap-dependent bound can be much better, . When is each bound tighter? Construct an instance where the gap-dependent bound is and one where it is .
References
Canonical:
- Lattimore & Szepesvari, Bandit Algorithms (2020), Chapters 6-8, 35-36
- Auer, Cesa-Bianchi, Fischer, Finite-time Analysis of the Multiarmed Bandit Problem (2002)
Current:
- Agrawal & Goyal, Analysis of Thompson Sampling for the Multi-armed Bandit Problem (2012)
- Russo et al., A Tutorial on Thompson Sampling (2018)
Next Topics
- Markov decision processes: extending bandits to sequential states and delayed rewards
- Policy gradient theorem: gradient-based optimization for sequential decision problems
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- No-Regret LearningLayer 3