Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Optimization Function Classes

Stochastic Approximation Theory

The Robbins-Monro framework, ODE method, and Polyak-Ruppert averaging: the unified theory behind why SGD, Q-learning, and TD-learning converge.

AdvancedTier 2Stable~60 min

Why This Matters

Stochastic gradient descent, Q-learning, temporal difference learning, and dozens of other iterative algorithms in ML and RL share a common skeleton: an update rule that moves toward a target using noisy observations. Stochastic approximation is the theory that explains why these algorithms converge despite the noise, and how fast.

If you understand stochastic approximation, you understand the convergence of SGD, Q-learning, and TD-learning as special cases of one theorem. Without it, each convergence proof looks like a separate miracle.

Mental Model

You want to find a root θ\theta^* of some function h(θ)=0h(\theta) = 0, but you cannot evaluate hh directly. Instead, at each step you observe h(θt)+ξth(\theta_t) + \xi_t where ξt\xi_t is zero-mean noise. The question: under what conditions on the step sizes {ηt}\{\eta_t\} and the noise {ξt}\{\xi_t\} does the sequence θt\theta_t converge to θ\theta^*?

The answer is the Robbins-Monro theorem: if the step sizes decay at the right rate (slow enough to integrate to infinity, fast enough for their squares to be summable) and the noise is controlled, then θtθ\theta_t \to \theta^* almost surely.

Formal Setup

Consider the iteration:

θt+1=θt+ηt[h(θt)+ξt+1]\theta_{t+1} = \theta_t + \eta_t \left[h(\theta_t) + \xi_{t+1}\right]

where θtRd\theta_t \in \mathbb{R}^d is the parameter, ηt>0\eta_t > 0 is the step size, h:RdRdh: \mathbb{R}^d \to \mathbb{R}^d is the mean field (the function whose root we seek), and ξt+1\xi_{t+1} is a noise term.

Definition

Robbins-Monro Conditions

The step size sequence {ηt}t0\{\eta_t\}_{t \geq 0} satisfies the Robbins-Monro conditions if:

t=0ηt=andt=0ηt2<\sum_{t=0}^{\infty} \eta_t = \infty \qquad \text{and} \qquad \sum_{t=0}^{\infty} \eta_t^2 < \infty

The first condition ensures the iterates can reach any target (the steps do not decay too fast). The second ensures the accumulated noise variance is finite (the steps decay fast enough to average out noise).

The canonical choice is ηt=c/t\eta_t = c / t for some c>0c > 0, which satisfies both conditions. The harmonic series diverges (1/t=\sum 1/t = \infty) while the Basel series converges (1/t2=π2/6<\sum 1/t^2 = \pi^2/6 < \infty).

Watch Out

Why not constant step size?

A constant step size ηt=η\eta_t = \eta satisfies ηt=\sum \eta_t = \infty but violates ηt2<\sum \eta_t^2 < \infty. The iterates will keep bouncing around θ\theta^* with variance proportional to η\eta. They converge to a neighborhood of θ\theta^*, not to θ\theta^* itself. This is actually what practitioners use in SGD (constant learning rate with eventual decay), and the theory predicts exactly the stationary distribution the iterates reach.

The Robbins-Monro Theorem

Theorem

Robbins-Monro Convergence

Statement

Under the above conditions, the stochastic approximation iterates θt+1=θt+ηt[h(θt)+ξt+1]\theta_{t+1} = \theta_t + \eta_t[h(\theta_t) + \xi_{t+1}] satisfy:

θtθalmost surely as t\theta_t \to \theta^* \quad \text{almost surely as } t \to \infty

where θ\theta^* is the unique root of hh.

Intuition

The noise averages out because the step sizes decay (second Robbins-Monro condition). The signal accumulates because the step sizes decay slowly enough (first condition). The "pointing toward θ\theta^*" condition (θh(θ)<0\theta \cdot h(\theta) < 0) ensures that the mean field always pushes the iterate back toward the root.

Proof Sketch

The standard proof uses a supermartingale argument. Define Vt=θtθ2V_t = \|\theta_t - \theta^*\|^2. Expand:

Vt+1=Vt+2ηtθtθ,h(θt)+2ηtθtθ,ξt+1+ηt2h(θt)+ξt+12V_{t+1} = V_t + 2\eta_t \langle \theta_t - \theta^*, h(\theta_t) \rangle + 2\eta_t \langle \theta_t - \theta^*, \xi_{t+1} \rangle + \eta_t^2 \|h(\theta_t) + \xi_{t+1}\|^2

Taking conditional expectation: the cross term with ξt+1\xi_{t+1} vanishes (martingale property). The inner product θtθ,h(θt)\langle \theta_t - \theta^*, h(\theta_t) \rangle is negative by the stability condition. The ηt2\eta_t^2 term is summable. By the Robbins-Siegmund supermartingale convergence theorem, VtV_t converges a.s., and since the negative drift is not summable unless θtθ\theta_t \to \theta^*, we conclude convergence.

Why It Matters

This single theorem implies the convergence of:

  • SGD: set h(θ)=F(θ)h(\theta) = -\nabla F(\theta), ξt+1=F(θt)fit(θt)\xi_{t+1} = \nabla F(\theta_t) - \nabla f_{i_t}(\theta_t)
  • Q-learning: set h(Q)=TQQh(Q) = T^*Q - Q where TT^* is the Bellman optimality operator
  • TD(0): set h(V)=TπVVh(V) = T^\pi V - V where TπT^\pi is the Bellman evaluation operator

All are stochastic approximation with different mean fields hh.

Failure Mode

If the noise variance grows faster than the stability condition can control (e.g., E[ξt+12Ft]\mathbb{E}[\|\xi_{t+1}\|^2 | \mathcal{F}_t] grows polynomially with θt\|\theta_t\|), the iterates can diverge. This is exactly what happens with Q-learning under function approximation (the "deadly triad"): the mean field hh is no longer a contraction in the right norm, and the variance grows with the approximation error.

The ODE Method

The ODE method, due to Ljung (1977) and refined by Borkar and Meyn (2000), gives a more powerful framework for analyzing stochastic approximation.

Definition

Associated ODE

For the stochastic approximation iteration θt+1=θt+ηt[h(θt)+ξt+1]\theta_{t+1} = \theta_t + \eta_t[h(\theta_t) + \xi_{t+1}], the associated ordinary differential equation is:

dθdt=h(θ)\frac{d\theta}{dt} = h(\theta)

The ODE method says: if the ODE has a globally asymptotically stable equilibrium θ\theta^*, then the stochastic approximation iterates track the ODE solution and converge to θ\theta^*.

Theorem

ODE Method for Stochastic Approximation

Statement

Under the above conditions, θtθ\theta_t \to \theta^* almost surely. The trajectory of the interpolated process (piecewise-linear through the iterates, with time rescaled by the step sizes) converges to the trajectory of the ODE θ˙=h(θ)\dot{\theta} = h(\theta).

Intuition

Over short time windows, the noise averages out (law of large numbers). What remains is the deterministic drift h(θ)h(\theta), which is exactly the ODE. The stochastic iterates "track" the ODE solution, with the noise causing fluctuations that shrink as the step size decreases.

Proof Sketch

  1. Interpolate the discrete iterates into a continuous trajectory θˉ(t)\bar{\theta}(t) where time increments are ηt\eta_t.
  2. Show the noise contribution k=mnηkξk+1\sum_{k=m}^{n} \eta_k \xi_{k+1} converges to zero over any fixed time window as mm \to \infty (by the martingale strong law).
  3. The remaining trajectory satisfies an integral equation that approximates the ODE.
  4. By Gronwall's inequality, the deviation from the ODE solution is controlled by the step size and noise residual.
  5. Conclude that θˉ(t)\bar{\theta}(t) converges to the ODE's stable equilibrium.

Why It Matters

The ODE method reduces stochastic convergence questions to deterministic stability questions. To prove Q-learning converges, you only need to show that the ODE Q˙=TQQ\dot{Q} = T^*Q - Q has a stable equilibrium (the optimal Q-function). The Bellman operator TT^* is a contraction in \|\cdot\|_\infty, so the ODE is globally stable, and convergence follows immediately.

Failure Mode

The ODE method requires the iterates to remain bounded a.s. (the "stability" condition). This is the hardest assumption to verify in practice. For linear stochastic approximation (h(θ)=Aθ+bh(\theta) = A\theta + b), stability holds when AA has eigenvalues with negative real parts. For nonlinear problems, verifying stability often requires constructing a Lyapunov function.

Polyak-Ruppert Averaging

The Robbins-Monro iterates converge at rate O(1/t)O(1/\sqrt{t}) in general. Polyak (1990) and Ruppert (1988) independently discovered that averaging the iterates achieves the optimal O(1/t)O(1/\sqrt{t}) rate with the optimal asymptotic constant, without requiring careful step size tuning.

Theorem

Polyak-Ruppert Averaging

Statement

Define the averaged iterate θˉt=1tk=1tθk\bar{\theta}_t = \frac{1}{t}\sum_{k=1}^{t} \theta_k. Then:

t(θˉtθ)dN(0,A1Σ(A1))\sqrt{t}(\bar{\theta}_t - \theta^*) \xrightarrow{d} \mathcal{N}(0, A^{-1}\Sigma(A^{-1})^\top)

This is the optimal asymptotic covariance: it matches the Cramér-Rao lower bound for the stochastic approximation problem.

Intuition

Each individual iterate θt\theta_t is noisy and converges slowly. But the noise in consecutive iterates is nearly independent (because the step size is small). Averaging tt nearly-independent estimates reduces variance by a factor of tt. The optimal asymptotic covariance A1Σ(A1)A^{-1}\Sigma(A^{-1})^\top is the best possible for any estimator that only observes the noisy function evaluations.

Why It Matters

Polyak-Ruppert averaging explains why "averaged SGD" works so well in practice: you can use a relatively large, slowly decaying step size (which gives fast initial progress) and then average to get optimal asymptotic variance. This is strictly better than carefully tuning the step size schedule. The result also connects stochastic approximation to classical statistics: the averaged iterate is asymptotically efficient in the sense of Cramér-Rao.

Failure Mode

The averaging result requires the step size exponent α(1/2,1)\alpha \in (1/2, 1). If α=1\alpha = 1 (the canonical ηt=c/t\eta_t = c/t), the individual iterates already achieve the optimal rate (for the right constant cc), but the optimal cc depends on the unknown Jacobian AA. Averaging with α<1\alpha < 1 avoids this sensitivity: you do not need to know AA to get the optimal rate. If α1/2\alpha \leq 1/2, the step sizes do not decay fast enough and the noise is not averaged out properly.

Linear Stochastic Approximation

The most tractable and well-understood case is when hh is linear: h(θ)=Aθ+bh(\theta) = A\theta + b where ARd×dA \in \mathbb{R}^{d \times d} and bRdb \in \mathbb{R}^d.

Definition

Linear Stochastic Approximation

The linear stochastic approximation iteration is:

θt+1=θt+ηt(Aθt+b+ξt+1)\theta_{t+1} = \theta_t + \eta_t(A\theta_t + b + \xi_{t+1})

where AA has all eigenvalues with strictly negative real parts (so θ=A1b\theta^* = -A^{-1}b is the unique root of hh). The associated ODE is θ˙=Aθ+b\dot{\theta} = A\theta + b, which is a stable linear system.

This case is important because:

  • TD(0) with linear function approximation is linear SA with A=ΦD(PγΦΦ)wA = \Phi^\top D(P^\gamma \Phi - \Phi)w and b=ΦDrb = \Phi^\top Dr
  • Least-squares SGD is linear SA with A=E[xtxt]A = -\mathbb{E}[x_t x_t^\top] and b=E[ytxt]b = \mathbb{E}[y_t x_t]
  • For linear SA, the boundedness condition in the ODE method is automatic when AA is Hurwitz (all eigenvalues have negative real parts)

Connections and Special Cases

AlgorithmMean field h(θ)h(\theta)Noise ξt+1\xi_{t+1}Contraction?
SGDF(θ)-\nabla F(\theta)Gradient noiseYes (if strongly convex)
Q-learningTQQT^*Q - QTD errorYes (γ\gamma-contraction in \ell^\infty)
TD(0)TπVVT^\pi V - VTD errorYes (γ\gamma-contraction in \ell^\infty)
SARSATπQQT^\pi Q - QTD errorYes (γ\gamma-contraction)
Empirical riskPfPnfP f - P_n fSampling noiseProblem-dependent
Watch Out

Robbins-Monro vs Kiefer-Wolfowitz

The Robbins-Monro algorithm assumes you can observe h(θ)+noiseh(\theta) + \text{noise} directly (i.e., you have a noisy function evaluation). The Kiefer-Wolfowitz algorithm handles the case where you can only observe function values F(θ)+noiseF(\theta) + \text{noise}, not gradients. It estimates the gradient by finite differences, adding an extra layer of approximation. SGD is Robbins-Monro (you observe stochastic gradients). Gradient-free optimization is Kiefer-Wolfowitz.

Watch Out

Step size conditions are sufficient, not necessary

The Robbins-Monro conditions ηt=,ηt2<\sum \eta_t = \infty, \sum \eta_t^2 < \infty are sufficient for convergence but not necessary. Convergence can hold under weaker conditions (e.g., ηt=c/tα\eta_t = c/t^\alpha with α(1/2,1)\alpha \in (1/2, 1) does not satisfy ηt2<\sum \eta_t^2 < \infty for α1/2\alpha \leq 1/2 but can still give convergence in specific problems). The conditions are sharp for the general problem: there exist mean fields hh and noise sequences for which convergence fails if either condition is violated.

Key Results Timeline

  • 1951: Robbins and Monro prove the original convergence theorem for scalar root-finding with noisy observations.
  • 1952: Kiefer and Wolfowitz extend to optimization (gradient-free setting).
  • 1977: Ljung introduces the ODE method, connecting stochastic approximation to dynamical systems theory.
  • 1988-1990: Ruppert and Polyak independently discover iterate averaging achieves optimal rates without step size tuning.
  • 1994: Tsitsiklis proves Q-learning convergence using the ODE method.
  • 2000: Borkar and Meyn give the definitive treatment of the ODE method with weaker assumptions.

Exercises

ExerciseCore

Problem

Show that ηt=c/tα\eta_t = c/t^\alpha satisfies the Robbins-Monro conditions if and only if α(1/2,1]\alpha \in (1/2, 1]. For each boundary case (α=1/2\alpha = 1/2 and α=1\alpha = 1), determine which condition holds and which fails.

ExerciseCore

Problem

Given a loss function F(θ)=EzP[(θ,z)]F(\theta) = \mathbb{E}_{z \sim P}[\ell(\theta, z)] and the SGD update θt+1=θtηt(θt,zt)\theta_{t+1} = \theta_t - \eta_t \nabla \ell(\theta_t, z_t), identify the mean field h(θ)h(\theta) and the noise ξt+1\xi_{t+1} in the Robbins-Monro framework. What condition on FF ensures the stability condition θθ,h(θ)<0\langle \theta - \theta^*, h(\theta) \rangle < 0?

ExerciseAdvanced

Problem

Consider the one-dimensional problem: find the root of h(θ)=θ+3h(\theta) = -\theta + 3 (so θ=3\theta^* = 3). With noise ξtN(0,1)\xi_t \sim \mathcal{N}(0, 1) and step size ηt=1/t0.6\eta_t = 1/t^{0.6}, simulate 10,000 steps. Compare the MSE of θt\theta_t vs the averaged θˉt=1tk=1tθk\bar{\theta}_t = \frac{1}{t}\sum_{k=1}^t \theta_k over 100 independent runs. What rate does each achieve? What is the asymptotic variance of θˉt\bar{\theta}_t predicted by the Polyak-Ruppert theorem?

References

Canonical:

  • Robbins, H. and Monro, S., "A Stochastic Approximation Method," Annals of Mathematical Statistics 22(3), 1951. The original paper.
  • Borkar, V. S., Stochastic Approximation: A Dynamical Systems Viewpoint, Cambridge University Press, 2008, Ch. 2-5. Modern reference for the ODE method.
  • Kushner, H. J. and Yin, G. G., Stochastic Approximation and Recursive Algorithms and Applications, 2nd ed., Springer, 2003, Ch. 1-6, 8.
  • Benveniste, A., Metivier, M., Priouret, P., Adaptive Algorithms and Stochastic Approximations, Springer, 1990, Ch. 1-3. Classical treatment with martingale arguments.
  • Polyak, B. T. and Juditsky, A. B., "Acceleration of Stochastic Approximation by Averaging," SIAM J. Control and Optimization 30(4), 1992. The averaging theorem.
  • Ljung, L., "Analysis of Recursive Stochastic Algorithms," IEEE Trans. Automatic Control 22(4), 1977. Original ODE method paper.

Current:

  • Szepesvari, C., Algorithms for Reinforcement Learning, Morgan and Claypool, 2010, Ch. 3. SA applied to RL.
  • Bottou, L., "Large-Scale Machine Learning with Stochastic Gradient Descent," COMPSTAT 2010, §3-4. Practical SGD perspective.
  • Bach, F. and Moulines, E., "Non-asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning," NeurIPS 2011, §3 on constant step-size behavior.
  • Srikant, R. and Ying, L., "Finite-Time Error Bounds For Linear Stochastic Approximation and TD Learning," COLT 2019, §3-4.

Frontier:

  • Mou, W., Li, C. J., Wainwright, M. J., Bartlett, P. L., Jordan, M. I., "On Linear Stochastic Approximation: Fine-grained Polyak-Ruppert and Non-asymptotic Concentration," COLT 2020.
  • Chen, Z., Zhang, S., Doan, T. T., Clarke, J.-P., Maguluri, S. T., "Finite-Sample Analysis of Nonlinear Stochastic Approximation with Applications in Reinforcement Learning," Automatica 146, 2022.

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics