Exploration vs Exploitation

Sneiderman, Robby

RL Theory

Exploration vs Exploitation

The fundamental tradeoff in sequential decision-making: exploit known good actions to collect reward now, or explore uncertain actions to discover potentially better strategies. Epsilon-greedy, Boltzmann exploration, UCB, count-based methods, and intrinsic motivation.

CoreTier 2StableSupporting~45 min

Prerequisites

Multi Armed Bandits Theory Markov Decision Processes Bitter Lesson Reinforcement Learning for Drug Discovery

Start 8-question practice · 11 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

rl-theory | layer 2 | tier 2. This page has 4 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Q-Learning

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

An agent that only exploits will get stuck with suboptimal actions. An agent that only explores will never accumulate reward. Every RL algorithm must balance these two imperatives, and the quality of this balance determines performance.

In bandits, the tradeoff is clean: pull the arm you think is best (exploit) or pull an arm you are uncertain about (explore). In MDPs, it is harder: an action's value depends on the entire future trajectory, and visiting a new state may reveal information about many other states.

The exploration problem is unsolved in general. For tabular bandits and small MDPs, we have near-optimal algorithms (UCB, Thompson sampling). For deep RL with large state spaces, exploration remains the primary bottleneck.

The Tradeoff Formalized

In a $K$ -armed bandit, at each round $t$ you choose an arm $A_t$ and observe reward $R_t$ . The regret after $T$ rounds is:

$\text{Regret}(T) = T \mu^* - \sum_{t=1}^T \mu_{A_t}$

where $\mu^* = \max_a \mu_a$ is the mean reward of the best arm. Regret measures the cumulative cost of not always playing the optimal arm.

A purely greedy algorithm (always play the arm with highest estimated mean) can have linear regret: it may lock onto a suboptimal arm and never discover the best one. Any algorithm achieving sublinear regret must explore.

Exploration Strategies

Definition

Epsilon-Greedy

With probability $1 - \epsilon$ , choose the action with highest estimated value (exploit). With probability $\epsilon$ , choose a uniformly random action (explore).

$A_t = \begin{cases} \arg\max_a Q_t(a) & \text{with probability } 1 - \epsilon \\ \text{Uniform}(\{1, \ldots, K\}) & \text{with probability } \epsilon \end{cases}$

Typically $\epsilon$ is decayed over time: $\epsilon_t = c / t$ ensures sufficient exploration early and near-greedy behavior later.

Epsilon-greedy is simple and widely used. Its weakness: it explores uniformly over all actions, wasting time on actions already known to be bad. It does not direct exploration toward uncertainty.

Definition

Boltzmann (Softmax) Exploration

Choose action $a$ with probability proportional to $\exp(Q_t(a) / \tau)$ :

$\pi(a) = \frac{\exp(Q_t(a) / \tau)}{\sum_{a'} \exp(Q_t(a') / \tau)}$

The temperature $\tau > 0$ controls the explore/exploit balance. High $\tau$ : near-uniform (exploration). Low $\tau$ : near-greedy (exploitation). As $\tau \to 0$ , this converges to the greedy policy.

Boltzmann exploration is smoother than epsilon-greedy: it directs more exploration toward high-value actions rather than exploring uniformly. But it does not account for uncertainty in the estimates. The softmax temperature $\tau$ plays the same role as in the cross-entropy loss and attention mechanisms: it controls the sharpness of the distribution over actions.

Definition

Upper Confidence Bound (UCB)

UCB1 selects the action with the highest upper confidence bound:

$A_t = \arg\max_a \left[ Q_t(a) + c \sqrt{\frac{\log t}{N_t(a)}} \right]$

where $N_t(a)$ is the number of times arm $a$ has been pulled and $c > 0$ is a constant (typically $c = \sqrt{2}$ ). The first term is exploitation; the second is an exploration bonus that grows for actions played less often.

UCB is "optimism in the face of uncertainty." It treats each arm as if it might be as good as its upper confidence bound. Arms with few observations have wide confidence intervals and get explored. Arms with many observations and low means get ignored.

Definition

Thompson Sampling

Thompson sampling (Thompson, 1933) is the Bayesian counterpart to UCB. Maintain a posterior distribution $\mathcal{P}_t(\mu_a)$ over each arm's mean reward. At round $t$ :

Sample $\tilde\mu_a \sim \mathcal{P}_t(\mu_a)$ independently for each arm $a$ .
Play the arm with the largest sample: $A_t = \arg\max_a \tilde\mu_a$ .
Update the posterior for the chosen arm using the observed reward.

For Bernoulli arms with Beta $(\alpha_a, \beta_a)$ priors, the posterior after $s_a$ successes and $f_a$ failures is Beta $(\alpha_a + s_a, \beta_a + f_a)$ , so the sampling step is one line of code.

Thompson sampling achieves the same $O(\log T)$ regret as UCB but with a different mechanism: instead of a worst-case upper bound, it randomizes proportionally to posterior probability of being optimal. Arms unlikely to be best are rarely sampled; arms with wide posteriors get occasional "lucky draws" that trigger exploration.

Two properties make Thompson sampling attractive in practice:

Structured priors propagate information between arms (e.g., linear bandits, contextual bandits, structured reward models). UCB requires hand-designed confidence sets for each structure; Thompson sampling reuses the same Bayesian machinery.
Batched/delayed feedback degrades UCB more than Thompson sampling because the deterministic bound becomes stale; the posterior-sample rule remains well-defined.

The regret analysis is due to Agrawal & Goyal (2012) for Bernoulli rewards and Russo & Van Roy (2014) via information-theoretic arguments. Both match the Lai-Robbins (1985) lower bound up to constants.

Main Theorems

Theorem

UCB1 Regret Bound

Statement

The UCB1 algorithm with $c = \sqrt{2}$ achieves expected regret at most:

$\mathbb{E}[\text{Regret}(T)] \leq \sum_{a: \mu_a < \mu^*} \frac{8 \log T}{\Delta_a} + (1 + \frac{\pi^2}{3}) \sum_{a=1}^K \Delta_a$

where $\Delta_a = \mu^* - \mu_a$ is the gap of arm $a$ . The dominant term is $O(\sum_{a} \log T / \Delta_a)$ .

Intuition

Each suboptimal arm $a$ is pulled approximately $\log T / \Delta_a^2$ times: enough to determine that its mean is at least $\Delta_a$ below the best arm. Arms with small gaps are harder to distinguish and require more pulls. The total regret from arm $a$ is then $\Delta_a \cdot O(\log T / \Delta_a^2) = O(\log T / \Delta_a)$ .

Proof Sketch

Decompose the number of times arm $a$ is pulled as $N_T(a) = \sum_{t=1}^T \mathbf{1}(A_t = a)$ . Arm $a$ is pulled at time $t$ only if its UCB exceeds the best arm's UCB. Using Hoeffding's inequality, the probability that the best arm's mean is underestimated shrinks as $1/t^4$ , and the probability that arm $a$ 's mean is overestimated by more than $\Delta_a$ shrinks similarly after $O(\log T / \Delta_a^2)$ pulls. Sum the contributions.

Why It Matters

The $O(\log T)$ regret scaling is optimal up to constants: Lai and Robbins (1985) proved that no algorithm can achieve $o(\log T)$ regret. UCB achieves the optimal rate without knowing the gaps $\Delta_a$ in advance. This is the gold standard for bandit algorithms.

Failure Mode

The bound depends on $1/\Delta_a$ , which blows up when arms have nearly equal means. In the worst case over gaps, summing gives $O(\sqrt{KT \log T})$ (the minimax rate up to log factors). UCB also assumes stationarity; for non-stationary bandits, the confidence intervals must be modified (e.g., sliding window UCB or discounted UCB).

report a correction →

Exploration Strategies Compared

Strategy	Directs exploration?	Uses uncertainty?	Regret	Strengths	Weaknesses
Epsilon-greedy	No (uniform random)	No	Can be sublinear with decay	Simple, easy to implement, universal	Wastes exploration on known-bad arms
Boltzmann/Softmax	Partially (favors high-value)	No	Depends on temperature schedule	Smoother than epsilon-greedy	Temperature requires tuning; ignores uncertainty
UCB1	Yes (high uncertainty)	Yes (confidence intervals)	$O(\sum \log T / \Delta_a)$	Near-optimal, no tuning needed, deterministic	Assumes bounded rewards, stationary environment
Thompson sampling	Yes (samples from posterior)	Yes (Bayesian posterior)	$O(\sum \log T / \Delta_a)$	Often empirically superior, extends to structured problems	Requires maintaining posterior, harder to analyze
Count-based (MDPs)	Yes (low visit count)	Indirectly (counts proxy for uncertainty)	Near-optimal in tabular MDPs	Principled, provable guarantees	Counts are infeasible in large or continuous state spaces
Curiosity/RND	Yes (prediction error)	Indirectly (model error as proxy)	No formal guarantees	Scales to deep RL, large state spaces	"Noisy TV" problem; can get stuck on stochastic transitions

Exploration in MDPs

Bandits have $K$ arms. MDPs have $|\mathcal{S}| \times |\mathcal{A}|$ state-action pairs, and visiting one state reveals information about transitions to other states. This makes exploration much harder.

Count-based exploration maintains visit counts $N(s, a)$ and adds an exploration bonus to the reward:

$r^+(s, a) = r(s, a) + \frac{\beta}{\sqrt{N(s,a)}}$

This encourages visiting state-action pairs that have been seen less often. For tabular MDPs, count-based bonuses achieve near-optimal regret bounds.

Intrinsic motivation (curiosity) replaces visit counts with prediction error. If the agent cannot predict the outcome of an action well, that action is "interesting" and worth exploring. Formally, the exploration bonus is:

$r^+(s, a) = \|f(s, a) - \hat{f}(s, a)\|$

where $\hat{f}$ is a learned forward model. This scales to large state spaces where explicit counts are infeasible.

The problem: in stochastic environments, unpredictable transitions generate high prediction error forever, trapping the agent in "noisy TV" states that are unpredictable but uninformative. Random Network Distillation (RND) partially addresses this by measuring novelty through a fixed random network rather than through forward prediction.

Hard Exploration Problems

Not all environments are equally difficult to explore. The following taxonomy clarifies when exploration is the binding constraint.

Dense reward environments (e.g., CartPole, continuous control with shaped rewards) are easy to explore. Even epsilon-greedy works because the agent receives frequent feedback about action quality. The exploration problem is mild; the challenge is optimization.

Sparse reward environments (e.g., Montezuma's Revenge, goal-conditioned navigation) are hard because the agent must perform a long sequence of correct actions before receiving any reward signal. Random exploration is exponentially unlikely to find the first reward. The agent needs structured exploration: either intrinsic motivation to seek novel states, or demonstration data to bootstrap learning.

Deceptive reward environments are the hardest case. The environment has a local optimum that is easy to find and a global optimum that requires passing through a region of low reward. Greedy exploitation locks onto the local optimum. Even count-based exploration may not help if the path to the global optimum is narrow in state space. Hierarchical RL and option-based exploration address this by operating at multiple temporal scales: high-level options explore macro-strategies while low-level policies exploit within a chosen strategy.

The information-theoretic view of exploration frames the problem as maximizing information gain about the environment's dynamics or reward function. Each action's exploration value is the expected reduction in entropy of the agent's beliefs. This connects exploration to experimental design in statistics and to Bayesian estimation: the agent is running an adaptive experiment where each action is a query and each observation is data.

Connection to Active Learning

Exploration in RL is analogous to active learning in supervised learning. In active learning, the learner chooses which data points to label. The goal is the same: gather information efficiently by querying uncertain regions rather than sampling randomly.

The key difference: in RL, exploration has a cost (suboptimal actions reduce cumulative reward). In pool-based active learning, querying any point has the same cost. This makes RL exploration harder because you must balance information gain against immediate reward loss.

Common Confusions

Watch Out

Epsilon-greedy is not the same as random exploration

Epsilon-greedy is $1 - \epsilon$ greedy and $\epsilon$ random. With $\epsilon = 0.1$ and 10 actions, each non-greedy action gets probability $0.01$ , which may be too low to discover a good arm quickly. The randomness is uniform over all actions and does not target uncertain actions.

Watch Out

Thompson sampling vs UCB

Thompson sampling maintains a posterior distribution over each arm's mean and samples from it. UCB constructs a deterministic upper bound. Both achieve $O(\log T)$ regret. Thompson sampling is often empirically better and extends more naturally to structured problems, but its theoretical analysis is more complex. Both are "optimistic" in different senses: UCB uses a worst-case confidence bound, Thompson sampling uses a random sample that is optimistic on average.

Watch Out

Exploration is not just about bandits

In MDPs, a single exploratory action can reveal information about many states (by reaching a new part of the state space). This sequential structure makes MDP exploration qualitatively different from bandit exploration. Algorithms that treat each state independently (like per-state epsilon-greedy) can require exponentially more exploration than algorithms that plan to explore (like R-MAX or UCRL).

Exercises

ExerciseCore

Problem

An agent uses epsilon-greedy with $\epsilon = 0.1$ in a 4-armed bandit. The true arm means are $\mu = (0.2, 0.8, 0.5, 0.3)$ . If the agent's current estimates correctly identify arm 2 as the best, what is the expected reward per step?

ExerciseAdvanced

Problem

Prove that a purely greedy algorithm (always choose $\arg\max_a Q_t(a)$ , breaking ties randomly) can suffer linear regret. Construct a specific 2-armed bandit instance where this happens.

ExerciseCore

Problem

In UCB1, arm $a$ has been pulled $N_t(a) = 10$ times with average reward $Q_t(a) = 0.7$ , and the current round is $t = 100$ . Arm $b$ has been pulled $N_t(b) = 50$ times with $Q_t(b) = 0.75$ . With $c = \sqrt{2}$ , which arm does UCB1 select?

References

Pre-canonical (foundational):

Thompson, "On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples" (1933), Biometrika 25(3/4), 285-294. The original posterior-sampling rule; predates the bandit literature.
Robbins, "Some Aspects of the Sequential Design of Experiments" (1952), Bull. AMS 58, 527-535. Introduces the multi-armed bandit problem.
Lai & Robbins, "Asymptotically Efficient Adaptive Allocation Rules" (1985), Adv. Appl. Math. 6, 4-22. Proves the $\Omega(\log T)$ regret lower bound that makes UCB and Thompson sampling instance-optimal.

Canonical:

Sutton & Barto, Reinforcement Learning: An Introduction (2018), Chapters 2 and 8
Lattimore & Szepesvari, Bandit Algorithms (2020), Chapters 7-8

Current:

Auer, Cesa-Bianchi, Fischer, "Finite-time Analysis of the Multiarmed Bandit Problem" (Machine Learning 2002) for UCB1.
Agrawal & Goyal, "Analysis of Thompson Sampling for the Multi-armed Bandit Problem" (COLT 2012). First $O(\log T)$ regret bound for Thompson sampling.
Russo & Van Roy, "Learning to Optimize via Posterior Sampling" (Math. of OR 2014). Information-theoretic analysis of Thompson sampling.
Bellemare et al., "Unifying Count-Based Exploration and Intrinsic Motivation" (NeurIPS 2016).
Burda et al., "Exploration by Random Network Distillation" (ICLR 2019).

Last reviewed: April 17, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Markov Decision Processeslayer 2 · tier 1
The Bitter Lessonlayer 3 · tier 1
Multi-Armed Bandits Theorylayer 2 · tier 2
Reinforcement Learning for Drug Discoverylayer 4 · tier 3

Derived topics

0

No published topic currently declares this as a prerequisite.