Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

RL Theory

Q-Learning

Model-free, off-policy value learning: the Q-learning update rule, convergence under Robbins-Monro conditions, and the deep Q-network revolution that introduced function approximation, experience replay, and the deadly triad.

CoreTier 2Stable~50 min
0

Why This Matters

Value iteration requires knowing the transition model P(ss,a)P(s'|s,a). You need to sum over all possible next states. In most real problems, you do not have this model. You interact with the environment and observe transitions one at a time.

Q-learning solves this: it learns the optimal action-value function QQ^* from individual transitions (s,a,r,s)(s, a, r, s'), without ever building a model of the environment. It is the most important model-free RL algorithm and the foundation of Deep Q-Networks (DQN), which achieved human-level performance on Atari games and launched the deep RL era.

Mental Model

Value iteration updates V(s)V(s) by taking a max over all actions and summing over all next states weighted by their transition probabilities. Q-learning replaces this model-based expectation with a single sample: observe one transition (s,a,r,s)(s, a, r, s') and update Q(s,a)Q(s,a) toward r+γmaxaQ(s,a)r + \gamma \max_{a'} Q(s', a').

Over many samples, the sample averages converge to the true expectations, and QQ converges to QQ^*. The key requirement: you must visit every state-action pair infinitely often (exploration) and decrease the step size appropriately (learning rate schedule).

Formal Setup and Notation

We work in a finite MDP (S,A,P,R,γ)(\mathcal{S}, \mathcal{A}, P, R, \gamma) but assume PP is unknown. The agent observes transitions (st,at,rt,st+1)(s_t, a_t, r_t, s_{t+1}) by interacting with the environment.

Definition

Q-Learning Update Rule

Given a transition (s,a,r,s)(s, a, r, s'), the Q-learning update is:

Q(s,a)Q(s,a)+αt[r+γmaxaAQ(s,a)Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha_t \left[ r + \gamma \max_{a' \in \mathcal{A}} Q(s', a') - Q(s,a) \right]

where αt(0,1]\alpha_t \in (0,1] is the learning rate (step size) at time tt. The term in brackets is called the TD error (temporal difference error).

Definition

Off-Policy Learning

Q-learning is off-policy: the policy used to select actions during training (the behavior policy) can differ from the policy being learned (the target policy, which is greedy with respect to QQ). This is because the update uses maxaQ(s,a)\max_{a'} Q(s', a'), which does not depend on which action was actually taken in state ss'.

Definition

Robbins-Monro Conditions

The Robbins-Monro conditions on the step sizes require:

t=0αt=andt=0αt2<\sum_{t=0}^{\infty} \alpha_t = \infty \quad \text{and} \quad \sum_{t=0}^{\infty} \alpha_t^2 < \infty

The first condition ensures the steps are large enough to overcome any initial error. The second ensures the noise from stochastic updates eventually dies out. A common choice is αt=1/t\alpha_t = 1/t.

Main Theorems

Theorem

Q-Learning Convergence

Statement

Under the stated conditions, the Q-learning iterates converge to the optimal action-value function with probability 1:

Qt(s,a)a.s.Q(s,a)for all (s,a)Q_t(s,a) \xrightarrow{a.s.} Q^*(s,a) \quad \text{for all } (s,a)

where QQ^* is the unique fixed point of the Bellman optimality operator on Q-values: Q(s,a)=R(s,a)+γsP(ss,a)maxaQ(s,a)Q^*(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \max_{a'} Q^*(s',a').

Intuition

The Q-learning update is a stochastic approximation to the Bellman optimality update. At each step, the sample r+γmaxaQ(s,a)r + \gamma \max_{a'} Q(s', a') is a noisy estimate of (TQ)(s,a)=R(s,a)+γsP(ss,a)maxaQ(s,a)(\mathcal{T}Q)(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \max_{a'} Q(s', a'). Since T\mathcal{T} is a γ\gamma-contraction, the stochastic approximation theory of Robbins-Monro guarantees convergence to the fixed point as long as the noise decays (step sizes shrink) and every state-action pair is updated infinitely often.

Proof Sketch

The proof applies the general stochastic approximation theorem. Write the Q-learning update as:

Qt+1(s,a)=(1αt)Qt(s,a)+αt[(TQt)(s,a)+wt]Q_{t+1}(s,a) = (1 - \alpha_t) Q_t(s,a) + \alpha_t [(\mathcal{T}Q_t)(s,a) + w_t]

where wt=[r+γmaxaQt(s,a)](TQt)(s,a)w_t = [r + \gamma \max_{a'} Q_t(s',a')] - (\mathcal{T}Q_t)(s,a) is zero-mean noise (conditioned on the history). The operator T\mathcal{T} is a γ\gamma-contraction. By the ODE method for stochastic approximation, this converges to the fixed point of T\mathcal{T} under the Robbins-Monro conditions on αt\alpha_t and the infinite-visitation assumption.

Why It Matters

This theorem is the theoretical foundation of model-free RL. It says you can learn optimal behavior without knowing the environment dynamics, just by trying things and observing results. The price you pay is the need to explore every state-action pair infinitely often. a condition that becomes impossible to satisfy in large state spaces.

Failure Mode

The convergence guarantee requires visiting every (s,a)(s,a) pair infinitely often. In practice, with epsilon-greedy exploration in a large state space, many pairs are visited rarely or never. More critically, this theorem applies only to the tabular case. It says nothing about function approximation.

From Tabular to Deep: DQN

For large or continuous state spaces, maintaining a table Q(s,a)Q(s,a) for every state-action pair is impossible. Deep Q-Networks (DQN) approximate QQ^* with a neural network Qθ(s,a)Q_\theta(s,a).

The naive approach. simply replacing the table with a network. fails catastrophically. DQN introduced two stabilization techniques:

Experience Replay. Store transitions (s,a,r,s)(s, a, r, s') in a replay buffer D\mathcal{D}. At each training step, sample a random minibatch from D\mathcal{D} and update. This breaks the correlation between consecutive samples (which violates the i.i.d. assumption of SGD) and reuses data efficiently.

Target Network. Maintain a separate target network QθQ_{\theta^-} that is updated slowly (e.g., copied from QθQ_\theta every CC steps). The training target becomes:

y=r+γmaxaQθ(s,a)y = r + \gamma \max_{a'} Q_{\theta^-}(s', a')

This prevents the moving-target problem where the update target changes with every gradient step.

The DQN loss for a minibatch is:

L(θ)=E(s,a,r,s)D[(yQθ(s,a))2]\mathcal{L}(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ (y - Q_\theta(s,a))^2 \right]

The Deadly Triad

Proposition

The Deadly Triad

Statement

The combination of function approximation, bootstrapping, and off-policy learning can cause Q-value estimates to diverge. Specifically, there exist MDPs and linear function approximators where semi-gradient Q-learning with off-policy data produces unbounded Q-values:

Qtas t\|Q_t\|_\infty \to \infty \quad \text{as } t \to \infty

Any two of the three components are safe; it is the combination of all three that creates instability.

Intuition

Bootstrapping means the update target depends on the current estimates (a moving target). Function approximation means updating one state-action pair affects the values of other pairs (generalization). Off-policy data means the distribution of updates does not match the distribution the current policy would generate. Together, these can create a feedback loop where overestimation in one region propagates and amplifies through the function approximator.

Why It Matters

DQN uses all three components. The target network and experience replay are engineering solutions that mitigate (but do not eliminate) instability. The deadly triad explains why DQN training can be brittle and why many subsequent algorithms (Double DQN, Dueling DQN, distributional RL) focus on stabilization. It also explains the appeal of on-policy methods like PPO, which avoid the off-policy component entirely.

DQN Variants

Double DQN. The max operator in maxaQθ(s,a)\max_{a'} Q_\theta(s',a') causes systematic overestimation because E[maxiXi]maxiE[Xi]\mathbb{E}[\max_i X_i] \geq \max_i \mathbb{E}[X_i]. Double DQN decouples action selection from evaluation:

y=r+γQθ(s,argmaxaQθ(s,a))y = r + \gamma Q_{\theta^-}(s', \arg\max_{a'} Q_\theta(s', a'))

Dueling DQN. Decompose the Q-network into value and advantage streams: Qθ(s,a)=Vθ(s)+Aθ(s,a)1AaAθ(s,a)Q_\theta(s,a) = V_\theta(s) + A_\theta(s,a) - \frac{1}{|\mathcal{A}|}\sum_{a'} A_\theta(s,a'). This helps when the value of a state matters more than the relative value of actions.

Q-Learning vs. SARSA

Q-learning is off-policy: it always updates toward maxaQ(s,a)\max_{a'} Q(s', a'). SARSA is on-policy: it updates toward Q(s,a)Q(s', a') where aa' is the action actually taken:

Q(s,a)Q(s,a)+α[r+γQ(s,a)Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma Q(s', a') - Q(s,a)]

SARSA converges to QπQ^\pi (the value of the current policy) rather than QQ^*. It is safer in practice. a SARSA agent with epsilon-greedy exploration learns to avoid dangerous states, while a Q-learning agent learns the optimal policy ignoring exploration costs.

Common Confusions

Watch Out

Q-learning is not the same as DQN

Q-learning is a tabular algorithm with convergence guarantees. DQN is Q-learning combined with neural network function approximation, experience replay, and target networks. The convergence guarantee of tabular Q-learning does not transfer to DQN. The deadly triad means DQN can diverge, and its success is empirical rather than theoretically guaranteed.

Watch Out

The max in Q-learning is not the same as the max in value iteration

In value iteration, maxa[R(s,a)+γsP(ss,a)V(s)]\max_a [R(s,a) + \gamma \sum_{s'} P(s'|s,a) V(s')] is computed exactly over all next states. In Q-learning, r+γmaxaQ(s,a)r + \gamma \max_{a'} Q(s', a') uses a single sampled next state ss'. The max over actions is still exact (in the tabular case), but the expectation over transitions is replaced by a single sample.

Watch Out

Exploration is not solved by Q-learning

Q-learning convergence assumes all state-action pairs are visited infinitely often. It does not specify how to achieve this. Epsilon-greedy is the simplest strategy (take random actions with probability ϵ\epsilon), but it is inefficient in large or sparse-reward environments. Efficient exploration (UCB, Thompson sampling, curiosity-driven methods) is a separate research area.

Summary

  • Q-learning update: Q(s,a)Q(s,a)+α[r+γmaxaQ(s,a)Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a'} Q(s',a') - Q(s,a)]
  • Converges to QQ^* under Robbins-Monro step sizes and infinite exploration
  • Off-policy: the behavior policy and target policy can differ
  • DQN adds function approximation, experience replay, and target networks
  • The deadly triad: function approximation + bootstrapping + off-policy can diverge
  • Double DQN fixes overestimation; dueling DQN improves value/advantage decomposition

Exercises

ExerciseCore

Problem

In a deterministic MDP with two states {s1,s2}\{s_1, s_2\} and one action per state, R(s1)=1R(s_1) = 1, R(s2)=0R(s_2) = 0, s1s2s_1 \to s_2, s2s2s_2 \to s_2, γ=0.9\gamma = 0.9, and α=0.1\alpha = 0.1. Starting from Q0=0Q_0 = 0, compute Q1(s1)Q_1(s_1) after one observed transition from s1s_1.

ExerciseCore

Problem

Why does experience replay help DQN training? Give two distinct reasons.

ExerciseAdvanced

Problem

Explain why the deadly triad does not arise in tabular Q-learning. Which of the three components is missing, and why does its absence prevent divergence?

Related Comparisons

References

Canonical:

  • Watkins & Dayan, "Q-Learning" (1992). original convergence proof
  • Tsitsiklis, "Asynchronous Stochastic Approximation and Q-Learning" (1994)

Current:

  • Mnih et al., "Human-Level Control through Deep Reinforcement Learning" (Nature, 2015). DQN
  • van Hasselt et al., "Deep Reinforcement Learning with Double Q-Learning" (2016)
  • Sutton & Barto, Reinforcement Learning: An Introduction (2018), Chapter 6

Next Topics

The natural next step from Q-learning:

  • Policy gradient theorem: when value-based methods struggle (continuous actions, high-dimensional spaces), optimize the policy directly

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics