RL Theory
Exploration vs Exploitation
The fundamental tradeoff in sequential decision-making: exploit known good actions to collect reward now, or explore uncertain actions to discover potentially better strategies. Epsilon-greedy, Boltzmann exploration, UCB, count-based methods, and intrinsic motivation.
Prerequisites
Why This Matters
An agent that only exploits will get stuck with suboptimal actions. An agent that only explores will never accumulate reward. Every RL algorithm must balance these two imperatives, and the quality of this balance determines performance.
In bandits, the tradeoff is clean: pull the arm you think is best (exploit) or pull an arm you are uncertain about (explore). In MDPs, it is harder: an action's value depends on the entire future trajectory, and visiting a new state may reveal information about many other states.
The exploration problem is unsolved in general. For tabular bandits and small MDPs, we have near-optimal algorithms (UCB, Thompson sampling). For deep RL with large state spaces, exploration remains the primary bottleneck.
The Tradeoff Formalized
In a -armed bandit, at each round you choose an arm and observe reward . The regret after rounds is:
where is the mean reward of the best arm. Regret measures the cumulative cost of not always playing the optimal arm.
A purely greedy algorithm (always play the arm with highest estimated mean) can have linear regret: it may lock onto a suboptimal arm and never discover the best one. Any algorithm achieving sublinear regret must explore.
Exploration Strategies
Epsilon-Greedy
With probability , choose the action with highest estimated value (exploit). With probability , choose a uniformly random action (explore).
Typically is decayed over time: ensures sufficient exploration early and near-greedy behavior later.
Epsilon-greedy is simple and widely used. Its weakness: it explores uniformly over all actions, wasting time on actions already known to be bad. It does not direct exploration toward uncertainty.
Boltzmann (Softmax) Exploration
Choose action with probability proportional to :
The temperature controls the explore/exploit balance. High : near-uniform (exploration). Low : near-greedy (exploitation). As , this converges to the greedy policy.
Boltzmann exploration is smoother than epsilon-greedy: it directs more exploration toward high-value actions rather than exploring uniformly. But it does not account for uncertainty in the estimates. The softmax temperature plays the same role as in the cross-entropy loss and attention mechanisms: it controls the sharpness of the distribution over actions.
Upper Confidence Bound (UCB)
UCB1 selects the action with the highest upper confidence bound:
where is the number of times arm has been pulled and is a constant (typically ). The first term is exploitation; the second is an exploration bonus that grows for actions played less often.
UCB is "optimism in the face of uncertainty." It treats each arm as if it might be as good as its upper confidence bound. Arms with few observations have wide confidence intervals and get explored. Arms with many observations and low means get ignored.
Main Theorems
UCB1 Regret Bound
Statement
The UCB1 algorithm with achieves expected regret at most:
where is the gap of arm . The dominant term is .
Intuition
Each suboptimal arm is pulled approximately times: enough to determine that its mean is at least below the best arm. Arms with small gaps are harder to distinguish and require more pulls. The total regret from arm is then .
Proof Sketch
Decompose the number of times arm is pulled as . Arm is pulled at time only if its UCB exceeds the best arm's UCB. Using Hoeffding's inequality, the probability that the best arm's mean is underestimated shrinks as , and the probability that arm 's mean is overestimated by more than shrinks similarly after pulls. Sum the contributions.
Why It Matters
The regret scaling is optimal up to constants: Lai and Robbins (1985) proved that no algorithm can achieve regret. UCB achieves the optimal rate without knowing the gaps in advance. This is the gold standard for bandit algorithms.
Failure Mode
The bound depends on , which blows up when arms have nearly equal means. In the worst case over gaps, summing gives (the minimax rate up to log factors). UCB also assumes stationarity; for non-stationary bandits, the confidence intervals must be modified (e.g., sliding window UCB or discounted UCB).
Exploration Strategies Compared
| Strategy | Directs exploration? | Uses uncertainty? | Regret | Strengths | Weaknesses |
|---|---|---|---|---|---|
| Epsilon-greedy | No (uniform random) | No | Can be sublinear with decay | Simple, easy to implement, universal | Wastes exploration on known-bad arms |
| Boltzmann/Softmax | Partially (favors high-value) | No | Depends on temperature schedule | Smoother than epsilon-greedy | Temperature requires tuning; ignores uncertainty |
| UCB1 | Yes (high uncertainty) | Yes (confidence intervals) | Near-optimal, no tuning needed, deterministic | Assumes bounded rewards, stationary environment | |
| Thompson sampling | Yes (samples from posterior) | Yes (Bayesian posterior) | Often empirically superior, extends to structured problems | Requires maintaining posterior, harder to analyze | |
| Count-based (MDPs) | Yes (low visit count) | Indirectly (counts proxy for uncertainty) | Near-optimal in tabular MDPs | Principled, provable guarantees | Counts are infeasible in large or continuous state spaces |
| Curiosity/RND | Yes (prediction error) | Indirectly (model error as proxy) | No formal guarantees | Scales to deep RL, large state spaces | "Noisy TV" problem; can get stuck on stochastic transitions |
Exploration in MDPs
Bandits have arms. MDPs have state-action pairs, and visiting one state reveals information about transitions to other states. This makes exploration much harder.
Count-based exploration maintains visit counts and adds an exploration bonus to the reward:
This encourages visiting state-action pairs that have been seen less often. For tabular MDPs, count-based bonuses achieve near-optimal regret bounds.
Intrinsic motivation (curiosity) replaces visit counts with prediction error. If the agent cannot predict the outcome of an action well, that action is "interesting" and worth exploring. Formally, the exploration bonus is:
where is a learned forward model. This scales to large state spaces where explicit counts are infeasible.
The problem: in stochastic environments, unpredictable transitions generate high prediction error forever, trapping the agent in "noisy TV" states that are unpredictable but uninformative. Random Network Distillation (RND) partially addresses this by measuring novelty through a fixed random network rather than through forward prediction.
Hard Exploration Problems
Not all environments are equally difficult to explore. The following taxonomy clarifies when exploration is the binding constraint.
Dense reward environments (e.g., CartPole, continuous control with shaped rewards) are easy to explore. Even epsilon-greedy works because the agent receives frequent feedback about action quality. The exploration problem is mild; the challenge is optimization.
Sparse reward environments (e.g., Montezuma's Revenge, goal-conditioned navigation) are hard because the agent must perform a long sequence of correct actions before receiving any reward signal. Random exploration is exponentially unlikely to find the first reward. The agent needs structured exploration: either intrinsic motivation to seek novel states, or demonstration data to bootstrap learning.
Deceptive reward environments are the hardest case. The environment has a local optimum that is easy to find and a global optimum that requires passing through a region of low reward. Greedy exploitation locks onto the local optimum. Even count-based exploration may not help if the path to the global optimum is narrow in state space. Hierarchical RL and option-based exploration address this by operating at multiple temporal scales: high-level options explore macro-strategies while low-level policies exploit within a chosen strategy.
The information-theoretic view of exploration frames the problem as maximizing information gain about the environment's dynamics or reward function. Each action's exploration value is the expected reduction in entropy of the agent's beliefs. This connects exploration to experimental design in statistics and to Bayesian estimation: the agent is running an adaptive experiment where each action is a query and each observation is data.
Connection to Active Learning
Exploration in RL is analogous to active learning in supervised learning. In active learning, the learner chooses which data points to label. The goal is the same: gather information efficiently by querying uncertain regions rather than sampling randomly.
The key difference: in RL, exploration has a cost (suboptimal actions reduce cumulative reward). In pool-based active learning, querying any point has the same cost. This makes RL exploration harder because you must balance information gain against immediate reward loss.
Common Confusions
Epsilon-greedy is not the same as random exploration
Epsilon-greedy is greedy and random. With and 10 actions, each non-greedy action gets probability , which may be too low to discover a good arm quickly. The randomness is uniform over all actions and does not target uncertain actions.
Thompson sampling vs UCB
Thompson sampling maintains a posterior distribution over each arm's mean and samples from it. UCB constructs a deterministic upper bound. Both achieve regret. Thompson sampling is often empirically better and extends more naturally to structured problems, but its theoretical analysis is more complex. Both are "optimistic" in different senses: UCB uses a worst-case confidence bound, Thompson sampling uses a random sample that is optimistic on average.
Exploration is not just about bandits
In MDPs, a single exploratory action can reveal information about many states (by reaching a new part of the state space). This sequential structure makes MDP exploration qualitatively different from bandit exploration. Algorithms that treat each state independently (like per-state epsilon-greedy) can require exponentially more exploration than algorithms that plan to explore (like R-MAX or UCRL).
Exercises
Problem
An agent uses epsilon-greedy with in a 4-armed bandit. The true arm means are . If the agent's current estimates correctly identify arm 2 as the best, what is the expected reward per step?
Problem
Prove that a purely greedy algorithm (always choose , breaking ties randomly) can suffer linear regret. Construct a specific 2-armed bandit instance where this happens.
Problem
In UCB1, arm has been pulled times with average reward , and the current round is . Arm has been pulled times with . With , which arm does UCB1 select?
References
Canonical:
- Sutton & Barto, Reinforcement Learning: An Introduction (2018), Chapters 2 and 8
- Lattimore & Szepesvari, Bandit Algorithms (2020), Chapters 7-8
Current:
- Auer, Cesa-Bianchi, Fischer, "Finite-time Analysis of the Multiarmed Bandit Problem" (2002) for UCB1
- Bellemare et al., "Unifying Count-Based Exploration and Intrinsic Motivation" (NeurIPS 2016)
- Burda et al., "Exploration by Random Network Distillation" (ICLR 2019)
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Multi-Armed Bandits TheoryLayer 2
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Markov Decision ProcessesLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Expectation, Variance, Covariance, and MomentsLayer 0A