RL Theory
Multi-Armed Bandits Theory
The exploration-exploitation tradeoff formalized: K arms, regret as the cost of not knowing the best arm, and algorithms (UCB, Thompson sampling) that achieve near-optimal regret bounds.
Prerequisites
Why This Matters
The multi-armed bandit problem is the simplest formalization of the exploration-exploitation tradeoff. You have slot machines. Each pull gives a random reward from an unknown distribution. You want to maximize total reward over pulls. Pulling the best arm every time is optimal, but you do not know which arm is best. Every pull of a suboptimal arm costs you, but every pull of a not-yet-explored arm teaches you something.
Bandits sit at the foundation of reinforcement learning, clinical trial design, and adaptive A/B testing. Understanding the regret framework here is a prerequisite for understanding regret in full MDPs.
Formal Setup
K-Armed Bandit
A stochastic K-armed bandit consists of arms, each with an unknown reward distribution with mean . At each round , the learner selects arm and observes reward , independent of past rewards given the arm choice.
Let be the mean of the best arm. Let be the gap for arm .
Cumulative Regret
The cumulative regret after rounds is:
Equivalently, , where is the number of times arm is pulled.
Regret measures the total cost of not knowing the best arm from the start. An algorithm with sublinear regret () learns the best arm eventually.
The UCB Algorithm
The Upper Confidence Bound (UCB1) algorithm balances exploration and exploitation by maintaining a confidence interval for each arm's mean.
At round , pull the arm maximizing:
where is the sample mean of arm and is its pull count so far. The second term is the exploration bonus: arms pulled less frequently have wider confidence intervals and get explored more.
Main Theorems
UCB1 Regret Bound
Statement
The UCB1 algorithm achieves expected cumulative regret bounded by:
This is for fixed gaps. In the worst case over gap configurations, this gives .
Intuition
The bound says UCB pulls each suboptimal arm roughly times. Arms with small gaps are hard to distinguish from the best arm, so they get pulled more. The total cost is the number of suboptimal pulls times their gap. The logarithmic dependence on means regret grows very slowly.
Proof Sketch
Fix a suboptimal arm . After pulls of arm , the confidence width is . Once , the upper confidence bound for arm falls below with high probability. So arm is pulled at most times. The additional terms handle the low-probability tail events.
Why It Matters
UCB shows that a simple, deterministic algorithm achieves near-optimal regret without any Bayesian assumptions. The regret is logarithmic in , matching the fundamental lower bound up to constants.
Failure Mode
The bound depends on for each arm. When many arms have gaps close to zero, the bound becomes very large. In adversarial (non-stochastic) settings, UCB1 can achieve linear regret. Use EXP3 for adversarial bandits instead.
Lai-Robbins Lower Bound
Statement
For any policy that achieves regret for every bandit instance and every , the expected number of pulls of each suboptimal arm satisfies:
where is the KL divergence from the suboptimal arm's distribution to the best arm's distribution.
Intuition
You must pull each suboptimal arm at least times, and the constant depends on how hard it is to distinguish that arm from the best arm (measured by KL divergence). Arms that look similar to the best arm require more exploration.
Proof Sketch
Change-of-measure argument. If arm were actually the best arm (swap the means), a consistent policy would need to pull arm many times. The KL divergence quantifies how many samples are needed to distinguish the two cases.
Why It Matters
This is the fundamental information-theoretic lower bound for bandits. It shows that UCB-type algorithms (with regret) are rate-optimal. No algorithm can do better than .
Failure Mode
The bound assumes stochastic, stationary rewards from exponential families. For adversarial or non-stationary environments, different lower bounds apply (minimax regret for adversarial bandits).
Thompson Sampling
Thompson sampling takes a Bayesian approach. Maintain a posterior for each arm's mean. At each round, sample from the posterior for each arm and pull .
For Bernoulli rewards with a Beta prior: start with for each arm. After observing a reward of 1 or 0, update the Beta parameters. Sample from each posterior and pull the arm with the highest sample.
Thompson sampling also achieves the Lai-Robbins lower bound asymptotically. Empirically, it often outperforms UCB because its exploration is naturally calibrated to the posterior uncertainty.
Contextual Bandits
Contextual Bandit
At each round , the learner observes a context , then selects arm , and observes reward . The optimal policy is .
Contextual bandits generalize the basic setting by conditioning arm rewards on side information. This is the formal model for personalized A/B testing: is the user's features, arms are treatment variants, and the goal is to learn the best variant for each user type.
LinUCB assumes and maintains confidence ellipsoids for each . It achieves regret where is the context dimension.
Common Confusions
Regret is not the same as simple regret
Cumulative regret measures the total cost of exploration across all rounds. Simple regret measures the quality of the arm you recommend after rounds. Minimizing cumulative regret requires balancing explore and exploit at every step. Minimizing simple regret allows pure exploration followed by a single recommendation. Different objectives, different algorithms.
Bandits are not full reinforcement learning
Bandits have no state transitions. The reward of pulling arm at round does not depend on past arm choices. Full RL adds state, where actions change the environment. Bandits are the stateless special case of MDPs.
Canonical Examples
A/B testing as a 2-armed bandit
You test two website layouts. Layout A has conversion rate 5%, layout B has conversion rate 7%. Each visitor is a round. A traditional A/B test uses a 50/50 split for a fixed sample size, then picks the winner. This incurs regret because half of visitors see the worse layout for the entire test. A bandit algorithm (UCB or Thompson sampling) shifts traffic toward B as evidence accumulates, achieving regret instead.
Key Takeaways
- Cumulative regret is the central performance measure
- UCB1 achieves worst-case regret and instance-dependent regret
- Lai-Robbins lower bound shows is unavoidable
- Thompson sampling matches the lower bound and is often the practical default
- Contextual bandits extend the framework to side information (personalization)
Exercises
Problem
Consider a 2-armed bandit where arm 1 has mean 0.6 and arm 2 has mean 0.4. What is the expected cumulative regret of a policy that pulls each arm with equal probability for all rounds?
Problem
Show that the UCB1 exploration bonus ensures that holds for all arms and all rounds with probability at least (using the Hoeffding bound).
References
Canonical:
- Lattimore & Szepesvari, Bandit Algorithms (2020), Chapters 7-8 (UCB), Chapter 36 (Thompson Sampling)
- Lai & Robbins, "Asymptotically Efficient Adaptive Allocation Rules" (1985)
Current:
- Slivkins, Introduction to Multi-Armed Bandits (2019), Chapters 1-4
- Russo et al., "A Tutorial on Thompson Sampling" (2018)
Next Topics
- Markov decision processes: bandits with state transitions
- Policy gradient theorem: gradient-based optimization when actions affect future states
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
Builds on This
- Exploration vs ExploitationLayer 2