Q-Learning

Sneiderman, Robby

RL Theory

Q-Learning

Model-free, off-policy value learning: the Q-learning update rule, convergence under Robbins-Monro conditions, and the deep Q-network revolution that introduced function approximation, experience replay, and the deadly triad.

CoreTier 1StableCore spine~55 min

Prerequisites

Value Iteration and Policy Iteration Bellman Equations Stochastic Approximation Theory Td Learning

Start 8-question practice · 5 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

rl-theory | layer 2 | tier 1. This page has 4 direct prerequisites and 5 published dependents.

Open Atlas Prerequisites Leads to

What next

Policy Gradient Theorem

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Value iteration requires knowing the transition model $P(s'|s,a)$ . You need to sum over all possible next states. In most real problems, you do not have this model. You interact with the environment and observe transitions one at a time.

Infographic on Q-learning: the Bellman optimality equation, the off-policy temporal-difference update rule, the convergence guarantee under tabular state-action spaces with sufficient exploration, and the function-approximation extension via deep Q-networks (DQN) with experience replay and target networks. — Q-learning is the canonical off-policy TD method. It converges in the tabular setting; DQN scales it to high-dimensional state spaces with experience replay and target networks.

Q-learning solves this: it learns the optimal action-value function $Q^*$ from individual transitions $(s, a, r, s')$ , without ever building a model of the environment. It is the most important model-free RL algorithm and the foundation of Deep Q-Networks (DQN), which achieved human-level performance on Atari games and launched the deep RL era.

The deadly triad: only the combination of function approximation, bootstrapping, and off-policy data diverges

Any two of the three components (function approximation, bootstrapping, off-policy) are safe. Tabular Q-learning removes function approximation; SARSA removes off-policy distribution mismatch. Combining all three creates a feedback loop where overestimation in one region propagates through shared parameters, and the Q-value norm can grow without bound. DQN's target network and replay buffer damp this failure mode but do not eliminate it. Curves are illustrative, not from a numerical simulation.

Mental Model

Value iteration updates $V(s)$ by taking a max over all actions and summing over all next states weighted by their transition probabilities. Q-learning replaces this model-based expectation with a single sample: observe one transition $(s, a, r, s')$ and update $Q(s,a)$ toward $r + \gamma \max_{a'} Q(s', a')$ .

Over many samples, the sample averages converge to the true expectations, and $Q$ converges to $Q^*$ . The key requirement: you must visit every state-action pair infinitely often (exploration) and decrease the step size appropriately (learning rate schedule).

Formal Setup and Notation

We work in a finite MDP $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$ but assume $P$ is unknown. The agent observes transitions $(s_t, a_t, r_t, s_{t+1})$ by interacting with the environment.

Definition

Q-Learning Update Rule

Given a transition $(s, a, r, s')$ , the Q-learning update is:

$Q(s,a) \leftarrow Q(s,a) + \alpha_t \left[ r + \gamma \max_{a' \in \mathcal{A}} Q(s', a') - Q(s,a) \right]$

where $\alpha_t \in (0,1]$ is the learning rate (step size) at time $t$ . The term in brackets is called the TD error (temporal difference error).

Definition

Off-Policy Learning

Q-learning is off-policy: the policy used to select actions during training (the behavior policy) can differ from the policy being learned (the target policy, which is greedy with respect to $Q$ ). This is because the update uses $\max_{a'} Q(s', a')$ , which does not depend on which action was actually taken in state $s'$ .

Definition

Robbins-Monro Conditions

The Robbins-Monro conditions on the step sizes require:

$\sum_{t=0}^{\infty} \alpha_t = \infty \quad \text{and} \quad \sum_{t=0}^{\infty} \alpha_t^2 < \infty$

The first condition ensures the steps are large enough to overcome any initial error. The second ensures the noise from stochastic updates eventually dies out. A common choice is $\alpha_t = 1/t$ .

Main Theorems

Theorem

Q-Learning Convergence

Statement

Under the stated conditions, the Q-learning iterates converge to the optimal action-value function with probability 1:

$Q_t(s,a) \xrightarrow{a.s.} Q^*(s,a) \quad \text{for all } (s,a)$

where $Q^*$ is the unique fixed point of the Bellman optimality operator on Q-values: $Q^*(s,a) = \mathbb{E}[R \mid s, a] + \gamma \sum_{s'} P(s'|s,a) \max_{a'} Q^*(s',a')$ .

Intuition

The Q-learning update is a stochastic approximation to the Bellman optimality update. At each step, the sample $r + \gamma \max_{a'} Q(s', a')$ is a noisy estimate of $(\mathcal{T}Q)(s,a) = \mathbb{E}[R \mid s, a] + \gamma \sum_{s'} P(s'|s,a) \max_{a'} Q(s', a')$ . Since $\mathcal{T}$ is a $\gamma$ -contraction, the stochastic approximation theory of Robbins-Monro guarantees convergence to the fixed point as long as the noise decays (step sizes shrink) and every state-action pair is updated infinitely often.

Proof Sketch

The proof applies the general stochastic approximation theorem. Write the Q-learning update as:

$Q_{t+1}(s,a) = (1 - \alpha_t) Q_t(s,a) + \alpha_t [(\mathcal{T}Q_t)(s,a) + w_t]$

where $w_t = [r + \gamma \max_{a'} Q_t(s',a')] - (\mathcal{T}Q_t)(s,a)$ is zero-mean noise (conditioned on the history). The operator $\mathcal{T}$ is a $\gamma$ -contraction. By the ODE method for stochastic approximation, this converges to the fixed point of $\mathcal{T}$ under the Robbins-Monro conditions on $\alpha_t$ and the infinite-visitation assumption.

Why It Matters

This theorem is the theoretical foundation of model-free RL. It says you can learn optimal behavior without knowing the environment dynamics, just by trying things and observing results. The price you pay is the need to explore every state-action pair infinitely often. a condition that becomes impossible to satisfy in large state spaces.

Failure Mode

The convergence guarantee requires visiting every $(s,a)$ pair infinitely often. In practice, with epsilon-greedy exploration in a large state space, many pairs are visited rarely or never. More critically, this theorem applies only to the tabular case. It says nothing about function approximation.

report a correction →

From Tabular to Deep: DQN

For large or continuous state spaces, maintaining a table $Q(s,a)$ for every state-action pair is impossible. Deep Q-Networks (DQN) approximate $Q^*$ with a neural network $Q_\theta(s,a)$ trained by stochastic gradient descent.

The naive approach. simply replacing the table with a network. fails catastrophically. DQN introduced two stabilization techniques:

Experience Replay. Store transitions $(s, a, r, s')$ in a replay buffer $\mathcal{D}$ . At each training step, sample a random minibatch from $\mathcal{D}$ and update. This breaks the correlation between consecutive samples (which violates the i.i.d. assumption of SGD) and reuses data efficiently.

Target Network. Maintain a separate target network $Q_{\theta^-}$ that is updated slowly (e.g., copied from $Q_\theta$ every $C$ steps). The training target becomes:

$y = r + \gamma \max_{a'} Q_{\theta^-}(s', a')$

This prevents the moving-target problem where the update target changes with every gradient step.

The DQN loss for a minibatch is:

$\mathcal{L}(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ (y - Q_\theta(s,a))^2 \right]$

The Deadly Triad

Proposition

The Deadly Triad

Statement

The combination of function approximation, bootstrapping, and off-policy learning can cause Q-value estimates to diverge. Specifically, there exist MDPs and linear function approximators where semi-gradient Q-learning with off-policy data produces unbounded Q-values:

$\|Q_t\|_\infty \to \infty \quad \text{as } t \to \infty$

Any two of the three components are safe; it is the combination of all three that creates instability.

Intuition

Bootstrapping means the update target depends on the current estimates (a moving target). Function approximation means updating one state-action pair affects the values of other pairs (generalization). Off-policy data means the distribution of updates does not match the distribution the current policy would generate. Together, these can create a feedback loop where overestimation in one region propagates and amplifies through the function approximator.

Why It Matters

DQN uses all three components. The target network and experience replay are engineering solutions that mitigate (but do not eliminate) instability. The deadly triad explains why DQN training can be brittle and why many subsequent algorithms (Double DQN, Dueling DQN, distributional RL) focus on stabilization. It also explains the appeal of on-policy methods like PPO, which avoid the off-policy component entirely.

report a correction →

DQN Variants

Double DQN. The max operator in $\max_{a'} Q_\theta(s',a')$ causes systematic overestimation because $\mathbb{E}[\max_i X_i] \geq \max_i \mathbb{E}[X_i]$ (a consequence of Jensen's inequality applied to the convex max function). Double DQN decouples action selection from evaluation:

$y = r + \gamma Q_{\theta^-}(s', \arg\max_{a'} Q_\theta(s', a'))$

Dueling DQN. Decompose the Q-network into value and advantage streams: $Q_\theta(s,a) = V_\theta(s) + A_\theta(s,a) - \frac{1}{|\mathcal{A}|}\sum_{a'} A_\theta(s,a')$ . This helps when the value of a state matters more than the relative value of actions.

Q-Learning vs. SARSA

Q-learning is off-policy: it always updates toward $\max_{a'} Q(s', a')$ . SARSA is on-policy: it updates toward $Q(s', a')$ where $a'$ is the action actually taken:

$Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma Q(s', a') - Q(s,a)]$

SARSA converges to $Q^\pi$ (the value of the current policy) rather than $Q^*$ . It is safer in practice. a SARSA agent with epsilon-greedy exploration learns to avoid dangerous states, while a Q-learning agent learns the optimal policy ignoring exploration costs.

Common Confusions

Watch Out

Q-learning is not the same as DQN

Q-learning is a tabular algorithm with convergence guarantees. DQN is Q-learning combined with neural network function approximation, experience replay, and target networks. The convergence guarantee of tabular Q-learning does not transfer to DQN. The deadly triad means DQN can diverge, and its success is empirical rather than theoretically guaranteed.

Watch Out

The max in Q-learning is not the same as the max in value iteration

In value iteration, $\max_a [R(s,a) + \gamma \sum_{s'} P(s'|s,a) V(s')]$ is computed exactly over all next states. In Q-learning, $r + \gamma \max_{a'} Q(s', a')$ uses a single sampled next state $s'$ . The max over actions is still exact (in the tabular case), but the expectation over transitions is replaced by a single sample.

Watch Out

Exploration is not solved by Q-learning

Q-learning convergence assumes all state-action pairs are visited infinitely often. It does not specify how to achieve this. Epsilon-greedy is the simplest strategy (take random actions with probability $\epsilon$ ), but it is inefficient in large or sparse-reward environments. Efficient exploration (UCB, Thompson sampling, curiosity-driven methods) is a separate research area.

Summary

Q-learning update: $Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a'} Q(s',a') - Q(s,a)]$
Converges to $Q^*$ under Robbins-Monro step sizes and infinite exploration
Off-policy: the behavior policy and target policy can differ
DQN adds function approximation, experience replay, and target networks
The deadly triad: function approximation + bootstrapping + off-policy can diverge
Double DQN fixes overestimation; dueling DQN improves value/advantage decomposition

Exercises

ExerciseCore

Problem

In a deterministic MDP with two states $\{s_1, s_2\}$ and one action per state, $R(s_1) = 1$ , $R(s_2) = 0$ , $s_1 \to s_2$ , $s_2 \to s_2$ , $\gamma = 0.9$ , and $\alpha = 0.1$ . Starting from $Q_0 = 0$ , compute $Q_1(s_1)$ after one observed transition from $s_1$ .

ExerciseCore

Problem

Why does experience replay help DQN training? Give two distinct reasons.

ExerciseAdvanced

Problem

Explain why the deadly triad does not arise in tabular Q-learning. Which of the three components is missing, and why does its absence prevent divergence?

Related Comparisons

References

Canonical:

Watkins & Dayan, "Q-Learning" (1992), DOI:10.1007/BF00992698. original convergence proof
Tsitsiklis, "Asynchronous Stochastic Approximation and Q-Learning" (1994), DOI:10.1007/BF00993306
Baird, "Residual Algorithms: Reinforcement Learning with Function Approximation" (ICML 1995), pp. 30-37. early divergence example for off-policy bootstrapping with approximation

Current:

Mnih et al., "Human-Level Control through Deep Reinforcement Learning" (Nature, 2015), DOI:10.1038/nature14236. DQN
van Hasselt et al., "Deep Reinforcement Learning with Double Q-Learning" (2016), arXiv:1509.06461
Sutton & Barto, Reinforcement Learning: An Introduction (2018), Chapter 6

Next Topics

The natural next step from Q-learning:

Policy gradient theorem: when value-based methods struggle (continuous actions, high-dimensional spaces), optimize the policy directly

Last reviewed: April 24, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Bellman Equationslayer 2 · tier 1
Value Iteration and Policy Iterationlayer 2 · tier 1
Stochastic Approximation Theorylayer 2 · tier 2
Temporal Difference Learninglayer 2 · tier 2

Derived topics

5

Policy Gradient Theoremlayer 3 · tier 1
Actor-Critic Methodslayer 3 · tier 2
DDPG: Deep Deterministic Policy Gradientlayer 3 · tier 2
Offline Reinforcement Learninglayer 3 · tier 2
TD3: Twin Delayed Deep Deterministic Policy Gradientlayer 3 · tier 2

Graph-backed continuations

Policy Gradient Theorem Actor-Critic Methods DDPG: Deep Deterministic Policy Gradient Offline Reinforcement Learning TD3: Twin Delayed Deep Deterministic Policy Gradient