Stochastic Approximation Theory

Stochastic gradient descent, Q-learning, temporal difference learning, and dozens of other iterative algorithms in ML and RL share a common skeleton: an update rule that moves toward a target using noisy observations. Stochastic approximation is the theory that explains why these algorithms converge despite the noise, and how fast.

If you understand stochastic approximation, you understand the convergence of SGD, Q-learning, and TD-learning as special cases of one theorem. Without it, each convergence proof looks like a separate miracle.

Stochastic approximation

Noisy updates can still track a stable mean field

noisy Robbins-Monro iterates

deterministic ODE path

Update rule

\theta_{t+1}=\theta_t+\eta_t(h(\theta_t)+\xi_{t+1})

Signal keeps adding

The sum of step sizes diverges, so the mean field can still pull the iterate to the root.

Noise gets averaged out

The sum of squared step sizes is finite, so accumulated martingale noise has bounded variance.

Constant steps do not settle

With a fixed learning rate, the path usually stabilizes around the root instead of converging to it.

Mental Model

You want to find a root $\theta^*$ of some function $h(\theta) = 0$ , but you cannot evaluate $h$ directly. Instead, at each step you observe $h(\theta_t) + \xi_t$ where $\xi_t$ is zero-mean noise. The question: under what conditions on the step sizes $\{\eta_t\}$ and the noise $\{\xi_t\}$ does the sequence $\theta_t$ converge to $\theta^*$ ?

The answer is the Robbins-Monro theorem: if the step sizes decay at the right rate (slow enough to integrate to infinity, fast enough for their squares to be summable) and the noise is controlled, then $\theta_t \to \theta^*$ almost surely.

Formal Setup

Consider the iteration:

$\theta_{t+1} = \theta_t + \eta_t \left[h(\theta_t) + \xi_{t+1}\right]$

where $\theta_t \in \mathbb{R}^d$ is the parameter, $\eta_t > 0$ is the step size, $h: \mathbb{R}^d \to \mathbb{R}^d$ is the mean field (the function whose root we seek), and $\xi_{t+1}$ is a noise term.

Definition

Robbins-Monro Conditions $η_{t}$

The step size sequence $\{\eta_t\}_{t \geq 0}$ satisfies the Robbins-Monro conditions if and only if:

$\sum_{t=0}^{\infty} \eta_t = \infty \qquad \text{and} \qquad \sum_{t=0}^{\infty} \eta_t^2 < \infty$

The first condition ensures the iterates can reach any target (the steps do not decay too fast). The second ensures the accumulated noise variance is finite (the steps decay fast enough to average out noise).

The canonical choice is $\eta_t = c / t$ for some $c > 0$ , which satisfies both conditions. The harmonic series diverges ( $\sum 1/t = \infty$ ) while the Basel series converges ( $\sum 1/t^2 = \pi^2/6 < \infty$ ).

Watch Out

Why not constant step size?

A constant step size $\eta_t = \eta$ satisfies $\sum \eta_t = \infty$ but violates $\sum \eta_t^2 < \infty$ . The iterates will keep bouncing around $\theta^*$ with variance proportional to $\eta$ . They converge to a neighborhood of $\theta^*$ , not to $\theta^*$ itself. This is actually what practitioners use in SGD (constant learning rate with eventual decay), and the theory predicts exactly the stationary distribution the iterates reach.

The Robbins-Monro Theorem

Theorem

Robbins-Monro Convergence

Statement

Under the above conditions, the stochastic approximation iterates $\theta_{t+1} = \theta_t + \eta_t[h(\theta_t) + \xi_{t+1}]$ satisfy:

$\theta_t \to \theta^* \quad \text{almost surely as } t \to \infty$

where $\theta^*$ is the unique root of $h$ .

Intuition

The noise averages out because the step sizes decay (second Robbins-Monro condition). The signal accumulates because the step sizes decay slowly enough (first condition). The "pointing toward $\theta^*$ " condition ( $\theta \cdot h(\theta) < 0$ ) ensures that the mean field always pushes the iterate back toward the root.

Proof Sketch

The standard proof uses a supermartingale argument. Define $V_t = \|\theta_t - \theta^*\|^2$ . Expand:

$V_{t+1} = V_t + 2\eta_t \langle \theta_t - \theta^*, h(\theta_t) \rangle + 2\eta_t \langle \theta_t - \theta^*, \xi_{t+1} \rangle + \eta_t^2 \|h(\theta_t) + \xi_{t+1}\|^2$

Taking conditional expectation: the cross term with $\xi_{t+1}$ vanishes (martingale property). The inner product $\langle \theta_t - \theta^*, h(\theta_t) \rangle$ is negative by the stability condition. The $\eta_t^2$ term is summable. By the Robbins-Siegmund supermartingale convergence theorem, $V_t$ converges a.s., and since the negative drift is not summable unless $\theta_t \to \theta^*$ , we conclude convergence.

Why It Matters

This single theorem implies the convergence of:

SGD: set $h(\theta) = -\nabla F(\theta)$ , $\xi_{t+1} = \nabla F(\theta_t) - \nabla f_{i_t}(\theta_t)$
Q-learning: set $h(Q) = T^*Q - Q$ where $T^*$ is the Bellman optimality operator
TD(0): set $h(V) = T^\pi V - V$ where $T^\pi$ is the Bellman evaluation operator

All are stochastic approximation with different mean fields $h$ .

Failure Mode

If the noise variance grows faster than the stability condition can control (e.g., $\mathbb{E}[\|\xi_{t+1}\|^2 | \mathcal{F}_t]$ grows polynomially with $\|\theta_t\|$ ), the iterates can diverge. This is exactly what happens with Q-learning under function approximation (the "deadly triad"): the mean field $h$ is no longer a contraction in the right norm, and the variance grows with the approximation error.

report a correction →

The ODE Method

The ODE method, due to Ljung (1977) and refined by Borkar and Meyn (2000), analyzes the iteration through a continuous-time differential equation. It handles nonlinear $h$ , asymptotically negligible noise, and Markovian noise, where direct martingale arguments break down.

Definition

Associated ODE

For the stochastic approximation iteration $\theta_{t+1} = \theta_t + \eta_t[h(\theta_t) + \xi_{t+1}]$ , the associated ordinary differential equation is:

$\frac{d\theta}{dt} = h(\theta)$

The ODE method says: if the ODE has a globally asymptotically stable equilibrium $\theta^*$ , then the stochastic approximation iterates track the ODE solution and converge to $\theta^*$ .

Theorem

ODE Method for Stochastic Approximation

Statement

Under the above conditions, $\theta_t \to \theta^*$ almost surely. The trajectory of the interpolated process (piecewise-linear through the iterates, with time rescaled by the step sizes) converges to the trajectory of the ODE $\dot{\theta} = h(\theta)$ .

Intuition

Over short time windows, the noise averages out (law of large numbers). What remains is the deterministic drift $h(\theta)$ , which is exactly the ODE. The stochastic iterates "track" the ODE solution, with the noise causing fluctuations that shrink as the step size decreases.

Proof Sketch

Interpolate the discrete iterates into a continuous trajectory $\bar{\theta}(t)$ where time increments are $\eta_t$ .
Show the noise contribution $\sum_{k=m}^{n} \eta_k \xi_{k+1}$ converges to zero over any fixed time window as $m \to \infty$ (by the martingale strong law).
The remaining trajectory satisfies an integral equation that approximates the ODE.
By Gronwall's inequality, the deviation from the ODE solution is controlled by the step size and noise residual.
Conclude that $\bar{\theta}(t)$ converges to the ODE's stable equilibrium.

Why It Matters

The ODE method reduces stochastic convergence questions to deterministic stability questions. ODE stability is one ingredient, not the whole proof. For Q-learning specifically, convergence (Tsitsiklis 1994; Jaakkola, Jordan, Singh 1994) requires the full asynchronous-stochastic-approximation toolkit: (i) every state-action pair is visited infinitely often (sufficient exploration), (ii) the per-coordinate step sizes satisfy the Robbins-Monro conditions, (iii) the iterates and rewards are bounded a.s., and (iv) the update noise forms a martingale-difference sequence with bounded conditional variance. Given those, one can conclude that the iterates track the ODE $\dot{Q} = T^*Q - Q$ , whose contraction in $\|\cdot\|_\infty$ then gives a unique globally attracting fixed point. The ODE stability alone is not enough.

Failure Mode

The ODE method requires the iterates to remain bounded a.s. (the "stability" condition). This is the hardest assumption to verify in practice. For linear stochastic approximation ( $h(\theta) = A\theta + b$ ), stability holds when $A$ has eigenvalues with negative real parts. For nonlinear problems, verifying stability often requires constructing a Lyapunov function.

report a correction →

Polyak-Ruppert Averaging

The Robbins-Monro iterates converge at rate $O(1/\sqrt{t})$ in general. Polyak (1990) and Ruppert (1988) independently discovered that averaging the iterates achieves the optimal $O(1/\sqrt{t})$ rate with the optimal asymptotic constant, without requiring careful step size tuning.

Theorem

Polyak-Ruppert Averaging

Statement

Define the averaged iterate $\bar{\theta}_t = \frac{1}{t}\sum_{k=1}^{t} \theta_k$ . Then:

$\sqrt{t}(\bar{\theta}_t - \theta^*) \xrightarrow{d} \mathcal{N}(0, A^{-1}\Sigma(A^{-1})^\top)$

This is the optimal asymptotic covariance: it matches the Cramér-Rao lower bound for the stochastic approximation problem.

Intuition

Each individual iterate $\theta_t$ is noisy and converges slowly. But the noise in consecutive iterates is nearly independent (because the step size is small). Averaging $t$ nearly-independent estimates reduces variance by a factor of $t$ . The optimal asymptotic covariance $A^{-1}\Sigma(A^{-1})^\top$ is the best possible for any estimator that only observes the noisy function evaluations.

Why It Matters

Polyak-Ruppert averaging explains why "averaged SGD" works so well in practice: you can use a relatively large, slowly decaying step size (which gives fast initial progress) and then average to get optimal asymptotic variance. This is strictly better than carefully tuning the step size schedule. The result also connects stochastic approximation to classical statistics: the averaged iterate is asymptotically efficient in the sense of Cramér-Rao.

Failure Mode

The averaging result requires the step size exponent $\alpha \in (1/2, 1)$ . If $\alpha = 1$ (the canonical $\eta_t = c/t$ ), the individual iterates already achieve the optimal rate (for the right constant $c$ ), but the optimal $c$ depends on the unknown Jacobian $A$ . Averaging with $\alpha < 1$ avoids this sensitivity: you do not need to know $A$ to get the optimal rate. If $\alpha \leq 1/2$ , the step sizes do not decay fast enough and the noise is not averaged out properly.

report a correction →

Linear Stochastic Approximation

The most tractable and well-understood case is when $h$ is linear: $h(\theta) = A\theta + b$ where $A \in \mathbb{R}^{d \times d}$ and $b \in \mathbb{R}^d$ .

Definition

Linear Stochastic Approximation

The linear stochastic approximation iteration is:

$\theta_{t+1} = \theta_t + \eta_t(A\theta_t + b + \xi_{t+1})$

where $A$ has all eigenvalues with strictly negative real parts (so $\theta^* = -A^{-1}b$ is the unique root of $h$ ). The associated ODE is $\dot{\theta} = A\theta + b$ , which is a stable linear system.

This case is important because:

TD(0) with linear function approximation is linear SA with $A = \Phi^\top D(P^\gamma \Phi - \Phi)w$ and $b = \Phi^\top Dr$
Least-squares SGD is linear SA with $A = -\mathbb{E}[x_t x_t^\top]$ and $b = \mathbb{E}[y_t x_t]$
For linear SA, the boundedness condition in the ODE method is automatic when $A$ is Hurwitz (all eigenvalues have negative real parts)

Connections and Special Cases

Algorithm	Mean field $h(\theta)$	Noise $\xi_{t+1}$	Contraction?
SGD	$-\nabla F(\theta)$	Gradient noise	Yes (if strongly convex)
Q-learning	$T^*Q - Q$	TD error	Yes ( $\gamma$ -contraction in $\ell^\infty$ )
TD(0)	$T^\pi V - V$	TD error	Yes ( $\gamma$ -contraction in $\ell^\infty$ )
SARSA	$T^\pi Q - Q$	TD error	Yes ( $\gamma$ -contraction)
Empirical risk	$P f - P_n f$	Sampling noise	Problem-dependent

Watch Out

Robbins-Monro vs Kiefer-Wolfowitz

The Robbins-Monro algorithm assumes you can observe $h(\theta) + \text{noise}$ directly (i.e., you have a noisy function evaluation). The Kiefer-Wolfowitz algorithm handles the case where you can only observe function values $F(\theta) + \text{noise}$ , not gradients. It estimates the gradient by finite differences, adding an extra layer of approximation. SGD is Robbins-Monro (you observe stochastic gradients). Gradient-free optimization is Kiefer-Wolfowitz.

Watch Out

Step size conditions are sufficient, not necessary

The Robbins-Monro conditions $\sum \eta_t = \infty, \sum \eta_t^2 < \infty$ are sufficient for convergence but not necessary. Convergence can hold under weaker conditions (e.g., $\eta_t = c/t^\alpha$ with $\alpha \in (1/2, 1)$ does not satisfy $\sum \eta_t^2 < \infty$ for $\alpha \leq 1/2$ but can still give convergence in specific problems). The conditions are sharp for the general problem: there exist mean fields $h$ and noise sequences for which convergence fails if either condition is violated.

Key Results Timeline

1951: Robbins and Monro prove the original convergence theorem for scalar root-finding with noisy observations.
1952: Kiefer and Wolfowitz extend to optimization (gradient-free setting).
1977: Ljung introduces the ODE method, connecting stochastic approximation to dynamical systems theory.
1988-1990: Ruppert and Polyak independently discover iterate averaging achieves optimal rates without step size tuning.
1994: Tsitsiklis proves Q-learning convergence using the ODE method.
2000: Borkar and Meyn give the definitive treatment of the ODE method with weaker assumptions.

Exercises

ExerciseCore

Problem

Show that $\eta_t = c/t^\alpha$ satisfies the Robbins-Monro conditions if and only if $\alpha \in (1/2, 1]$ . For each boundary case ( $\alpha = 1/2$ and $\alpha = 1$ ), determine which condition holds and which fails.

ExerciseCore

Problem

Given a loss function $F(\theta) = \mathbb{E}_{z \sim P}[\ell(\theta, z)]$ and the SGD update $\theta_{t+1} = \theta_t - \eta_t \nabla \ell(\theta_t, z_t)$ , identify the mean field $h(\theta)$ and the noise $\xi_{t+1}$ in the Robbins-Monro framework. What condition on $F$ ensures the stability condition $\langle \theta - \theta^*, h(\theta) \rangle < 0$ ?

ExerciseAdvanced

Problem

Consider the one-dimensional problem: find the root of $h(\theta) = -\theta + 3$ (so $\theta^* = 3$ ). With noise $\xi_t \sim \mathcal{N}(0, 1)$ and step size $\eta_t = 1/t^{0.6}$ , simulate 10,000 steps. Compare the MSE of $\theta_t$ vs the averaged $\bar{\theta}_t = \frac{1}{t}\sum_{k=1}^t \theta_k$ over 100 independent runs. What rate does each achieve? What is the asymptotic variance of $\bar{\theta}_t$ predicted by the Polyak-Ruppert theorem?

References

Canonical:

Robbins, H. and Monro, S., "A Stochastic Approximation Method," Annals of Mathematical Statistics 22(3), 1951. The original paper.
Borkar, V. S., Stochastic Approximation: A Dynamical Systems Viewpoint, Cambridge University Press, 2008, Ch. 2-5. Modern reference for the ODE method.
Kushner, H. J. and Yin, G. G., Stochastic Approximation and Recursive Algorithms and Applications, 2nd ed., Springer, 2003, Ch. 1-6, 8.
Benveniste, A., Metivier, M., Priouret, P., Adaptive Algorithms and Stochastic Approximations, Springer, 1990, Ch. 1-3. Classical treatment with martingale arguments.
Polyak, B. T. and Juditsky, A. B., "Acceleration of Stochastic Approximation by Averaging," SIAM J. Control and Optimization 30(4), 1992. The averaging theorem.
Ljung, L., "Analysis of Recursive Stochastic Algorithms," IEEE Trans. Automatic Control 22(4), 1977. Original ODE method paper.

Current:

Szepesvari, C., Algorithms for Reinforcement Learning, Morgan and Claypool, 2010, Ch. 3. SA applied to RL.
Bottou, L., "Large-Scale Machine Learning with Stochastic Gradient Descent," COMPSTAT 2010, §3-4. Practical SGD perspective.
Bach, F. and Moulines, E., "Non-asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning," NeurIPS 2011, §3 on constant step-size behavior.
Srikant, R. and Ying, L., "Finite-Time Error Bounds For Linear Stochastic Approximation and TD Learning," COLT 2019, §3-4.

Frontier:

Mou, W., Li, C. J., Wainwright, M. J., Bartlett, P. L., Jordan, M. I., "On Linear Stochastic Approximation: Fine-grained Polyak-Ruppert and Non-asymptotic Concentration," COLT 2020.
Chen, Z., Zhang, S., Doan, T. T., Clarke, J.-P., Maguluri, S. T., "Finite-Sample Analysis of Nonlinear Stochastic Approximation with Applications in Reinforcement Learning," Automatica 146, 2022.

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Borel-Cantelli Lemmaslayer 0B · tier 1
Convex Optimization Basicslayer 1 · tier 1
Stochastic Gradient Descent Convergencelayer 2 · tier 1
Martingale Theorylayer 0B · tier 2
Adaptive Learning Is Not IIDlayer 3 · tier 2

Derived topics

2

Q-Learninglayer 2 · tier 1
Temporal Difference Learninglayer 2 · tier 2

Graph-backed continuations

Temporal Difference Learning Q-Learning