Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

RL Theory

Exploration vs Exploitation

The fundamental tradeoff in sequential decision-making: exploit known good actions to collect reward now, or explore uncertain actions to discover potentially better strategies. Epsilon-greedy, Boltzmann exploration, UCB, count-based methods, and intrinsic motivation.

CoreTier 2Stable~45 min

Why This Matters

An agent that only exploits will get stuck with suboptimal actions. An agent that only explores will never accumulate reward. Every RL algorithm must balance these two imperatives, and the quality of this balance determines performance.

In bandits, the tradeoff is clean: pull the arm you think is best (exploit) or pull an arm you are uncertain about (explore). In MDPs, it is harder: an action's value depends on the entire future trajectory, and visiting a new state may reveal information about many other states.

The exploration problem is unsolved in general. For tabular bandits and small MDPs, we have near-optimal algorithms (UCB, Thompson sampling). For deep RL with large state spaces, exploration remains the primary bottleneck.

The Tradeoff Formalized

In a KK-armed bandit, at each round tt you choose an arm AtA_t and observe reward RtR_t. The regret after TT rounds is:

Regret(T)=Tμt=1TμAt\text{Regret}(T) = T \mu^* - \sum_{t=1}^T \mu_{A_t}

where μ=maxaμa\mu^* = \max_a \mu_a is the mean reward of the best arm. Regret measures the cumulative cost of not always playing the optimal arm.

A purely greedy algorithm (always play the arm with highest estimated mean) can have linear regret: it may lock onto a suboptimal arm and never discover the best one. Any algorithm achieving sublinear regret must explore.

Exploration Strategies

Definition

Epsilon-Greedy

With probability 1ϵ1 - \epsilon, choose the action with highest estimated value (exploit). With probability ϵ\epsilon, choose a uniformly random action (explore).

At={argmaxaQt(a)with probability 1ϵUniform({1,,K})with probability ϵA_t = \begin{cases} \arg\max_a Q_t(a) & \text{with probability } 1 - \epsilon \\ \text{Uniform}(\{1, \ldots, K\}) & \text{with probability } \epsilon \end{cases}

Typically ϵ\epsilon is decayed over time: ϵt=c/t\epsilon_t = c / t ensures sufficient exploration early and near-greedy behavior later.

Epsilon-greedy is simple and widely used. Its weakness: it explores uniformly over all actions, wasting time on actions already known to be bad. It does not direct exploration toward uncertainty.

Definition

Boltzmann (Softmax) Exploration

Choose action aa with probability proportional to exp(Qt(a)/τ)\exp(Q_t(a) / \tau):

π(a)=exp(Qt(a)/τ)aexp(Qt(a)/τ)\pi(a) = \frac{\exp(Q_t(a) / \tau)}{\sum_{a'} \exp(Q_t(a') / \tau)}

The temperature τ>0\tau > 0 controls the explore/exploit balance. High τ\tau: near-uniform (exploration). Low τ\tau: near-greedy (exploitation). As τ0\tau \to 0, this converges to the greedy policy.

Boltzmann exploration is smoother than epsilon-greedy: it directs more exploration toward high-value actions rather than exploring uniformly. But it does not account for uncertainty in the estimates. The softmax temperature τ\tau plays the same role as in the cross-entropy loss and attention mechanisms: it controls the sharpness of the distribution over actions.

Definition

Upper Confidence Bound (UCB)

UCB1 selects the action with the highest upper confidence bound:

At=argmaxa[Qt(a)+clogtNt(a)]A_t = \arg\max_a \left[ Q_t(a) + c \sqrt{\frac{\log t}{N_t(a)}} \right]

where Nt(a)N_t(a) is the number of times arm aa has been pulled and c>0c > 0 is a constant (typically c=2c = \sqrt{2}). The first term is exploitation; the second is an exploration bonus that grows for actions played less often.

UCB is "optimism in the face of uncertainty." It treats each arm as if it might be as good as its upper confidence bound. Arms with few observations have wide confidence intervals and get explored. Arms with many observations and low means get ignored.

Main Theorems

Theorem

UCB1 Regret Bound

Statement

The UCB1 algorithm with c=2c = \sqrt{2} achieves expected regret at most:

E[Regret(T)]a:μa<μ8logTΔa+(1+π23)a=1KΔa\mathbb{E}[\text{Regret}(T)] \leq \sum_{a: \mu_a < \mu^*} \frac{8 \log T}{\Delta_a} + (1 + \frac{\pi^2}{3}) \sum_{a=1}^K \Delta_a

where Δa=μμa\Delta_a = \mu^* - \mu_a is the gap of arm aa. The dominant term is O(alogT/Δa)O(\sum_{a} \log T / \Delta_a).

Intuition

Each suboptimal arm aa is pulled approximately logT/Δa2\log T / \Delta_a^2 times: enough to determine that its mean is at least Δa\Delta_a below the best arm. Arms with small gaps are harder to distinguish and require more pulls. The total regret from arm aa is then ΔaO(logT/Δa2)=O(logT/Δa)\Delta_a \cdot O(\log T / \Delta_a^2) = O(\log T / \Delta_a).

Proof Sketch

Decompose the number of times arm aa is pulled as NT(a)=t=1T1(At=a)N_T(a) = \sum_{t=1}^T \mathbf{1}(A_t = a). Arm aa is pulled at time tt only if its UCB exceeds the best arm's UCB. Using Hoeffding's inequality, the probability that the best arm's mean is underestimated shrinks as 1/t41/t^4, and the probability that arm aa's mean is overestimated by more than Δa\Delta_a shrinks similarly after O(logT/Δa2)O(\log T / \Delta_a^2) pulls. Sum the contributions.

Why It Matters

The O(logT)O(\log T) regret scaling is optimal up to constants: Lai and Robbins (1985) proved that no algorithm can achieve o(logT)o(\log T) regret. UCB achieves the optimal rate without knowing the gaps Δa\Delta_a in advance. This is the gold standard for bandit algorithms.

Failure Mode

The bound depends on 1/Δa1/\Delta_a, which blows up when arms have nearly equal means. In the worst case over gaps, summing gives O(KTlogT)O(\sqrt{KT \log T}) (the minimax rate up to log factors). UCB also assumes stationarity; for non-stationary bandits, the confidence intervals must be modified (e.g., sliding window UCB or discounted UCB).

Exploration Strategies Compared

StrategyDirects exploration?Uses uncertainty?RegretStrengthsWeaknesses
Epsilon-greedyNo (uniform random)NoCan be sublinear with decaySimple, easy to implement, universalWastes exploration on known-bad arms
Boltzmann/SoftmaxPartially (favors high-value)NoDepends on temperature scheduleSmoother than epsilon-greedyTemperature requires tuning; ignores uncertainty
UCB1Yes (high uncertainty)Yes (confidence intervals)O(logT/Δa)O(\sum \log T / \Delta_a)Near-optimal, no tuning needed, deterministicAssumes bounded rewards, stationary environment
Thompson samplingYes (samples from posterior)Yes (Bayesian posterior)O(logT/Δa)O(\sum \log T / \Delta_a)Often empirically superior, extends to structured problemsRequires maintaining posterior, harder to analyze
Count-based (MDPs)Yes (low visit count)Indirectly (counts proxy for uncertainty)Near-optimal in tabular MDPsPrincipled, provable guaranteesCounts are infeasible in large or continuous state spaces
Curiosity/RNDYes (prediction error)Indirectly (model error as proxy)No formal guaranteesScales to deep RL, large state spaces"Noisy TV" problem; can get stuck on stochastic transitions

Exploration in MDPs

Bandits have KK arms. MDPs have S×A|\mathcal{S}| \times |\mathcal{A}| state-action pairs, and visiting one state reveals information about transitions to other states. This makes exploration much harder.

Count-based exploration maintains visit counts N(s,a)N(s, a) and adds an exploration bonus to the reward:

r+(s,a)=r(s,a)+βN(s,a)r^+(s, a) = r(s, a) + \frac{\beta}{\sqrt{N(s,a)}}

This encourages visiting state-action pairs that have been seen less often. For tabular MDPs, count-based bonuses achieve near-optimal regret bounds.

Intrinsic motivation (curiosity) replaces visit counts with prediction error. If the agent cannot predict the outcome of an action well, that action is "interesting" and worth exploring. Formally, the exploration bonus is:

r+(s,a)=f(s,a)f^(s,a)r^+(s, a) = \|f(s, a) - \hat{f}(s, a)\|

where f^\hat{f} is a learned forward model. This scales to large state spaces where explicit counts are infeasible.

The problem: in stochastic environments, unpredictable transitions generate high prediction error forever, trapping the agent in "noisy TV" states that are unpredictable but uninformative. Random Network Distillation (RND) partially addresses this by measuring novelty through a fixed random network rather than through forward prediction.

Hard Exploration Problems

Not all environments are equally difficult to explore. The following taxonomy clarifies when exploration is the binding constraint.

Dense reward environments (e.g., CartPole, continuous control with shaped rewards) are easy to explore. Even epsilon-greedy works because the agent receives frequent feedback about action quality. The exploration problem is mild; the challenge is optimization.

Sparse reward environments (e.g., Montezuma's Revenge, goal-conditioned navigation) are hard because the agent must perform a long sequence of correct actions before receiving any reward signal. Random exploration is exponentially unlikely to find the first reward. The agent needs structured exploration: either intrinsic motivation to seek novel states, or demonstration data to bootstrap learning.

Deceptive reward environments are the hardest case. The environment has a local optimum that is easy to find and a global optimum that requires passing through a region of low reward. Greedy exploitation locks onto the local optimum. Even count-based exploration may not help if the path to the global optimum is narrow in state space. Hierarchical RL and option-based exploration address this by operating at multiple temporal scales: high-level options explore macro-strategies while low-level policies exploit within a chosen strategy.

The information-theoretic view of exploration frames the problem as maximizing information gain about the environment's dynamics or reward function. Each action's exploration value is the expected reduction in entropy of the agent's beliefs. This connects exploration to experimental design in statistics and to Bayesian estimation: the agent is running an adaptive experiment where each action is a query and each observation is data.

Connection to Active Learning

Exploration in RL is analogous to active learning in supervised learning. In active learning, the learner chooses which data points to label. The goal is the same: gather information efficiently by querying uncertain regions rather than sampling randomly.

The key difference: in RL, exploration has a cost (suboptimal actions reduce cumulative reward). In pool-based active learning, querying any point has the same cost. This makes RL exploration harder because you must balance information gain against immediate reward loss.

Common Confusions

Watch Out

Epsilon-greedy is not the same as random exploration

Epsilon-greedy is 1ϵ1 - \epsilon greedy and ϵ\epsilon random. With ϵ=0.1\epsilon = 0.1 and 10 actions, each non-greedy action gets probability 0.010.01, which may be too low to discover a good arm quickly. The randomness is uniform over all actions and does not target uncertain actions.

Watch Out

Thompson sampling vs UCB

Thompson sampling maintains a posterior distribution over each arm's mean and samples from it. UCB constructs a deterministic upper bound. Both achieve O(logT)O(\log T) regret. Thompson sampling is often empirically better and extends more naturally to structured problems, but its theoretical analysis is more complex. Both are "optimistic" in different senses: UCB uses a worst-case confidence bound, Thompson sampling uses a random sample that is optimistic on average.

Watch Out

Exploration is not just about bandits

In MDPs, a single exploratory action can reveal information about many states (by reaching a new part of the state space). This sequential structure makes MDP exploration qualitatively different from bandit exploration. Algorithms that treat each state independently (like per-state epsilon-greedy) can require exponentially more exploration than algorithms that plan to explore (like R-MAX or UCRL).

Exercises

ExerciseCore

Problem

An agent uses epsilon-greedy with ϵ=0.1\epsilon = 0.1 in a 4-armed bandit. The true arm means are μ=(0.2,0.8,0.5,0.3)\mu = (0.2, 0.8, 0.5, 0.3). If the agent's current estimates correctly identify arm 2 as the best, what is the expected reward per step?

ExerciseAdvanced

Problem

Prove that a purely greedy algorithm (always choose argmaxaQt(a)\arg\max_a Q_t(a), breaking ties randomly) can suffer linear regret. Construct a specific 2-armed bandit instance where this happens.

ExerciseCore

Problem

In UCB1, arm aa has been pulled Nt(a)=10N_t(a) = 10 times with average reward Qt(a)=0.7Q_t(a) = 0.7, and the current round is t=100t = 100. Arm bb has been pulled Nt(b)=50N_t(b) = 50 times with Qt(b)=0.75Q_t(b) = 0.75. With c=2c = \sqrt{2}, which arm does UCB1 select?

References

Canonical:

  • Sutton & Barto, Reinforcement Learning: An Introduction (2018), Chapters 2 and 8
  • Lattimore & Szepesvari, Bandit Algorithms (2020), Chapters 7-8

Current:

  • Auer, Cesa-Bianchi, Fischer, "Finite-time Analysis of the Multiarmed Bandit Problem" (2002) for UCB1
  • Bellemare et al., "Unifying Count-Based Exploration and Intrinsic Motivation" (NeurIPS 2016)
  • Burda et al., "Exploration by Random Network Distillation" (ICLR 2019)

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics