Skip to main content

Mathematical Infrastructure

Hamilton–Jacobi–Bellman Equation

The PDE characterizing the value function of a continuous-time stochastic optimal control problem. The continuous-time analog of the discrete Bellman equation, the fully nonlinear PDE that nonlinear Feynman–Kac inverts via BSDEs, and the equation Deep BSDE solves numerically in high dimensions.

AdvancedTier 2Stable~50 min
0

Why This Matters

The Hamilton–Jacobi–Bellman equation is the continuous-time Bellman equation: the PDE that the value function of a stochastic control problem must satisfy. The discrete Bellman recursion Vt(x)=mina{c(x,a)+E[Vt+1(X)]}V_t(x) = \min_a \{c(x, a) + \mathbb{E}[V_{t+1}(X')]\} becomes, after a Taylor expansion of the expectation in the time step, a nonlinear second-order PDE in VV. Every result that holds for the discrete dynamic-programming equation — optimal substructure, the policy read off from the Bellman operator, the contraction-mapping convergence of value iteration — has a continuous-time analog phrased in HJB language.

HJB is also the canonical fully nonlinear parabolic PDE that arises from probability. The nonlinearity sits inside an infimum (or supremum) over the control variable, and that infimum is exactly the Hamiltonian of the problem. The Feynman–Kac formula handles the linear case (no control, drift fixed); the BSDE machinery of Pardoux and Peng (1992) extends to semilinear PDEs; HJB sits at the top of this hierarchy as the fully nonlinear PDE that BSDEs with a control-dependent driver solve in the most general setting.

In modern ML, HJB is the equation that continuous-time reinforcement learning, optimal stopping, optimal execution in finance, and robotic control problems all reduce to. The grid-based curse of dimensionality makes classical HJB solvers useless above d6d \approx 6, which is the entire reason the Deep BSDE method and DGM / PINN-style PDE solvers exist: to approximate VV in regimes where finite differences cannot.

A useful slogan: Fokker–Planck moves densities forward, Feynman–Kac moves linear value functions backward, HJB moves optimized value functions backward. The three together cover the standard PDE-SDE dictionary for stochastic control.

Mental Model

The principle of optimality says: an optimal trajectory from (t,x)(t, x) is also optimal on every sub-interval [s,T][s, T] for s>ts > t, given the state XsX_s reached at time ss. Apply this principle infinitesimally. For a small time step dtdt, the optimal cost from (t,x)(t, x) equals the running cost incurred over [t,t+dt][t, t + dt] plus the value at (t+dt,Xt+dt)(t + dt, X_{t + dt}), minimized over the control choice on that interval. Taylor-expand both sides in dtdt, take the limit dt0dt \to 0, and the result is a PDE: the infimum over controls of (running cost plus generator of VV) equals tV-\partial_t V. That PDE is HJB.

The supremum (or infimum) over controls in HJB is the dynamic-programming analog of the maxa\max_a in the discrete Bellman operator. Reading off the control that achieves the infimum gives the optimal feedback policy u(t,x)u^*(t, x).

Formal Statement

Definition

Hamilton–Jacobi–Bellman Equation

Fix a horizon T>0T > 0 and an admissible control set URmU \subseteq \mathbb{R}^m. Consider the controlled SDE dXs=b(Xs,us)ds+σ(Xs,us)dBsdX_s = b(X_s, u_s)\,ds + \sigma(X_s, u_s)\,dB_s on [t,T][t, T] with Xt=xX_t = x, where u:[0,T]Uu: [0, T] \to U is a progressively measurable control process. The cost functional is

J(t,x;u)=E ⁣[tTf(Xs,us)ds+g(XT)Xt=x],J(t, x; u) = \mathbb{E}\!\left[\int_t^T f(X_s, u_s)\,ds + g(X_T) \,\Big|\, X_t = x\right],

with running cost ff and terminal cost gg. The value function is V(t,x)=infuJ(t,x;u)V(t, x) = \inf_u J(t, x; u), where the infimum is over admissible controls.

Under regularity, VC1,2([0,T)×Rd)C([0,T]×Rd)V \in C^{1,2}([0, T) \times \mathbb{R}^d) \cap C([0, T] \times \mathbb{R}^d) satisfies the HJB equation

tV(t,x)+infuU ⁣{b(x,u)V(t,x)+12tr ⁣(σσ(x,u)2V(t,x))+f(x,u)}=0,\partial_t V(t, x) + \inf_{u \in U}\!\left\{ b(x, u)\cdot \nabla V(t, x) + \tfrac{1}{2}\,\operatorname{tr}\!\big(\sigma \sigma^\top(x, u)\, \nabla^2 V(t, x)\big) + f(x, u) \right\} = 0,

with terminal condition V(T,x)=g(x)V(T, x) = g(x). The bracketed expression is the Hamiltonian H(x,p,M,u)H(x, p, M, u) evaluated at p=Vp = \nabla V, M=2VM = \nabla^2 V. Maximization (rather than minimization) gives the same PDE with supu\sup_u replacing infu\inf_u, used in reward-maximizing formulations.

The equation is fully nonlinear: the infimum over uu couples drift, diffusion, and running cost in a way that is not affine in V\nabla V or 2V\nabla^2 V. This is what distinguishes HJB from the linear backward Kolmogorov equation that Feynman–Kac inverts.

The Verification Theorem

The HJB equation is necessary for the value function under regularity. The verification theorem is the converse: a smooth solution of HJB whose infimum is attained by a measurable feedback control u(t,x)u^*(t, x) is the value function, and uu^* is optimal. This is the workhorse result that turns "find VV satisfying a PDE" into "you have just solved the control problem."

Theorem

HJB Verification Theorem

Statement

Under the assumptions above, W(t,x)=V(t,x)W(t, x) = V(t, x) for all (t,x)[0,T]×Rd(t, x) \in [0, T] \times \mathbb{R}^d, and the feedback control us=u(s,Xs)u^*_s = u^*(s, X^*_s), where XX^* solves the closed-loop SDE dXs=b(Xs,u(s,Xs))ds+σ(Xs,u(s,Xs))dBsdX^*_s = b(X^*_s, u^*(s, X^*_s))\,ds + \sigma(X^*_s, u^*(s, X^*_s))\,dB_s with Xt=xX^*_t = x, is optimal: J(t,x;u)=V(t,x)J(t, x; u^*) = V(t, x).

Intuition

Apply Itô's formula to W(s,Xs)W(s, X_s) along an arbitrary admissible control uu on [t,T][t, T]. The drift of W(s,Xs)W(s, X_s) is sW+bW+12tr(σσ2W)\partial_s W + b \cdot \nabla W + \tfrac{1}{2}\operatorname{tr}(\sigma \sigma^\top \nabla^2 W), which the HJB inequality bounds below by f(Xs,us)-f(X_s, u_s) for every choice of uu. Integrating gives W(t,x)J(t,x;u)W(t, x) \le J(t, x; u), so WVW \le V. For the optimal feedback uu^*, the HJB equation holds with equality and the bound becomes tight, giving W=VW = V.

Proof Sketch

For arbitrary admissible uu, apply Itô to W(s,Xs)W(s, X_s) on [t,T][t, T]:

W(T,XT)W(t,Xt)=tT ⁣ ⁣(sW+b(Xs,us)W+12tr(σσ2W))(s,Xs)ds+tT(W)σdBs.W(T, X_T) - W(t, X_t) = \int_t^T \!\!\big(\partial_s W + b(X_s, u_s) \cdot \nabla W + \tfrac{1}{2}\operatorname{tr}(\sigma \sigma^\top \nabla^2 W)\big)(s, X_s)\,ds + \int_t^T (\nabla W)^\top \sigma\,dB_s.

The HJB equation gives sW+bW+12tr(σσ2W)+f0\partial_s W + b \cdot \nabla W + \tfrac{1}{2} \operatorname{tr}(\sigma \sigma^\top \nabla^2 W) + f \ge 0 pointwise (since the infimum over uu is the smallest value), with equality at u=u(s,Xs)u = u^*(s, X_s). Take expectations; the stochastic integral is a martingale (polynomial growth plus BDG), so

E[g(XT)]W(t,x)E ⁣tTf(Xs,us)ds,\mathbb{E}[g(X_T)] - W(t, x) \ge -\mathbb{E}\!\int_t^T f(X_s, u_s)\,ds,

which rearranges to W(t,x)J(t,x;u)W(t, x) \le J(t, x; u). Equality holds along uu^*, so W=VW = V and uu^* achieves the infimum.

Why It Matters

This is the bridge between PDE analysis and control. Solve the HJB PDE analytically or numerically; read the optimal feedback u(t,x)u^*(t, x) off the argmin in the Hamiltonian; the resulting closed-loop SDE is guaranteed optimal among all admissible controls. Without verification, the HJB equation would just be a necessary condition and you would still need a separate optimality proof; with verification, the PDE is the optimality certificate. W=VW = V (the value function), and u(t,Xt)u^*(t, X_t) is an optimal control.

Failure Mode

The smoothness assumption WC1,2W \in C^{1,2} fails for many problems of practical interest: optimal stopping (where VV has a free boundary and 2V\nabla^2 V jumps), singular control, problems with state constraints, and degenerate diffusions where σσ\sigma \sigma^\top is rank-deficient. In all these cases the classical verification theorem does not apply directly, and one needs viscosity solutions or a regularization argument. A second failure mode: the infimum may not be attained inside UU (e.g., if UU is open or unbounded), in which case the candidate feedback u(t,x)u^*(t, x) is undefined and the closed-loop SDE has no strong solution.

Viscosity Solutions

For most realistic stochastic control problems the value function is not C1,2C^{1,2} and the classical verification theorem does not apply. The right notion of "solution" is the viscosity solution of Crandall and Lions (1983), extended to second-order PDEs by Crandall, Ishii, and Lions (1992).

The idea: replace pointwise differentiation of VV with a test-function inequality. VV is a viscosity sub-solution if, for every smooth test function φ\varphi with VφV - \varphi attaining a local maximum at (t0,x0)(t_0, x_0), the HJB operator applied to φ\varphi is non-positive at (t0,x0)(t_0, x_0). Super-solution is the dual inequality. A viscosity solution is both. This sidesteps the need for VV to be twice differentiable: the test function carries the derivatives, and the inequality only constrains VV at points where smooth functions can "touch" it.

Two facts make this framework load-bearing for HJB. First, under mild assumptions (continuity of b,σ,f,gb, \sigma, f, g, polynomial growth) the value function VV is the unique continuous viscosity solution of HJB. Second, viscosity solutions are stable under uniform convergence, so numerical schemes that approximate the operator (monotone finite differences, semi-Lagrangian methods, BSDE schemes) converge to the viscosity solution under Barles–Souganidis-style consistency conditions. Ishii's lemma is the key technical tool for the comparison principle that gives uniqueness.

Connection to Feynman–Kac and BSDEs

Strip the control out of HJB. With bb and σ\sigma fixed and the infimum dropped, the equation becomes the linear backward PDE

tV+bV+12tr(σσ2V)+f(x)=0,V(T,x)=g(x),\partial_t V + b \cdot \nabla V + \tfrac{1}{2}\operatorname{tr}(\sigma \sigma^\top \nabla^2 V) + f(x) = 0, \quad V(T, x) = g(x),

which is exactly what the Feynman–Kac formula inverts: V(t,x)=E[g(XT)+tTf(Xs)dsXt=x]V(t, x) = \mathbb{E}[g(X_T) + \int_t^T f(X_s)\,ds \mid X_t = x]. So the linear, no-control HJB is Feynman–Kac. The expectation representation is the value function of the trivial control problem with no decisions to make.

Restore the control, and the running cost becomes nonlinear in (V,2V)(\nabla V, \nabla^2 V) through the infimum. The nonlinear Feynman–Kac formula of Pardoux and Peng (1992) extends the representation: the value function is now V(t,x)=YtV(t, x) = Y_t where (Yt,Zt)(Y_t, Z_t) solve a backward SDE

Yt=g(XT)+tTH(Xs,Zs)dstTZsdBs,Y_t = g(X_T) + \int_t^T H^*(X_s, Z_s)\,ds - \int_t^T Z_s^\top dB_s,

with driver H(x,z)=infu{f(x,u)+b(x,u)σz+}H^*(x, z) = \inf_u \{f(x, u) + b(x, u) \cdot \sigma^{-\top} z + \dots\} (the precise form depends on whether the control enters the diffusion). The BSDE pair (Y,Z)(Y, Z) encodes both VV and σV\sigma^\top \nabla V along sample paths of XX, and the BSDE structure is what the Deep BSDE method exploits to solve HJB in d=100d = 100 by parameterizing Ztϕθ(Xt)Z_t \approx \phi_\theta(X_t) with a neural network at each time step.

Worked Example: Linear-Quadratic-Gaussian Control

Take linear dynamics, quadratic cost, additive Gaussian noise:

dXs=(AXs+Bus)ds+σdBs,J=E ⁣[tT(XsQXs+usRus)ds+XTSXT],dX_s = (A X_s + B u_s)\,ds + \sigma\,dB_s, \quad J = \mathbb{E}\!\left[\int_t^T (X_s^\top Q X_s + u_s^\top R u_s)\,ds + X_T^\top S X_T\right],

with Q,SQ, S symmetric positive semidefinite, RR symmetric positive definite, and σ\sigma a constant matrix (control-independent diffusion). Guess the value function is quadratic in xx: V(t,x)=xP(t)x+r(t)V(t, x) = x^\top P(t) x + r(t) for some matrix P(t)P(t) and scalar r(t)r(t) to be determined.

Compute V=2P(t)x\nabla V = 2 P(t) x and 2V=2P(t)\nabla^2 V = 2 P(t). Substitute into HJB:

xP˙x+r˙+infu ⁣{2xP(Ax+Bu)+tr(σσP)+xQx+uRu}=0.x^\top \dot P x + \dot r + \inf_{u}\!\big\{2 x^\top P (A x + B u) + \operatorname{tr}(\sigma \sigma^\top P) + x^\top Q x + u^\top R u\big\} = 0.

The infimum is unconstrained quadratic in uu; setting the gradient to zero gives u=R1BPxu^* = -R^{-1} B^\top P x, a linear feedback of the state. Plug back in:

xP˙x+r˙+2xPAxxPBR1BPx+tr(σσP)+xQx=0.x^\top \dot P x + \dot r + 2 x^\top P A x - x^\top P B R^{-1} B^\top P x + \operatorname{tr}(\sigma \sigma^\top P) + x^\top Q x = 0.

Symmetrizing 2xPAx=x(PA+AP)x2 x^\top P A x = x^\top (P A + A^\top P) x and matching the x()xx^\top (\cdot) x and constant terms separately gives the matrix Riccati ODE

P˙(t)+PA+APPBR1BP+Q=0,P(T)=S,\dot P(t) + P A + A^\top P - P B R^{-1} B^\top P + Q = 0, \quad P(T) = S,

and r˙(t)+tr(σσP(t))=0\dot r(t) + \operatorname{tr}(\sigma \sigma^\top P(t)) = 0, r(T)=0r(T) = 0. The Riccati equation is what the HJB PDE collapses to under the LQG ansatz: a finite-dimensional ODE in the matrix P(t)P(t), solvable by standard ODE integrators in any dimension where you can store PP.

Two consequences worth flagging. First, the optimal control is linear in the state with gain K(t)=R1BP(t)K(t) = R^{-1} B^\top P(t) — this is the classical LQR result, and it is the reason LQG / iLQR / DDP underlie so much of model-based RL and trajectory optimization. Second, the noise σ\sigma enters VV only through the additive scalar r(t)r(t) and not through P(t)P(t): the optimal feedback is certainty-equivalent — solve the deterministic problem, ignore the noise, and you get the same controller. Certainty equivalence is special to LQG and breaks immediately when RR, QQ, or the dynamics depend on uu multiplicatively or when costs are not quadratic.

Common Confusions

Watch Out

HJB is for the value function, not the optimal policy directly

The equation solves for V(t,x)V(t, x). The optimal feedback u(t,x)u^*(t, x) is read off as the argmin (or argmax) inside the Hamiltonian: u(t,x)=argminu{b(x,u)V+12tr(σσ2V)+f(x,u)}u^*(t, x) = \operatorname*{argmin}_u \{b(x, u) \cdot \nabla V + \tfrac{1}{2}\operatorname{tr}(\sigma \sigma^\top \nabla^2 V) + f(x, u)\}. You cannot solve for uu^* without first having VV (or a parametric guess for VV, as in the LQG example), which is why "policy iteration in continuous time" alternates between solving a linear PDE for VV given uu and updating uu from the argmin. The HJB equation itself is the fixed point of this alternation.

Watch Out

HJB runs backward in time; Fokker–Planck runs forward

HJB has a terminal condition V(T,x)=g(x)V(T, x) = g(x) and is solved backward from t=Tt = T to t=0t = 0. Its dual, Fokker–Planck, has an initial condition p(0,x)=p0(x)p(0, x) = p_0(x) and is solved forward from t=0t = 0 to t=Tt = T. They use the generator L\mathcal{L} and its adjoint L\mathcal{L}^* respectively. Confusing the time direction is a common implementation bug: the Euler-step update for HJB has the opposite sign on the time derivative compared to forward parabolic solvers.

Watch Out

Classical HJB grid solvers blow up exponentially in dimension

A finite-difference grid for V(t,x)V(t, x) with nn points per axis costs ndn^d memory; for d=100d = 100 this is hopeless. This is the entire motivation for Deep BSDE, DGM, PINNs in control, and policy-gradient methods in continuous-time RL: they sidestep the grid by sampling XX trajectories (Monte Carlo, polynomial in dd) and parameterizing VV or V\nabla V with neural networks. The trade-off is approximation error in VV versus exponential blow-up in storage; for d6d \gtrsim 6 the trade-off favors approximation every time.

Exercises

ExerciseCore

Problem

Specialize the LQG worked example to the scalar case: dXs=(aXs+bus)ds+σdBsdX_s = (a X_s + b u_s)\,ds + \sigma\,dB_s with running cost qXs2+rus2q X_s^2 + r u_s^2 and terminal cost sXT2s X_T^2, all coefficients positive scalars. Derive the scalar Riccati ODE for P(t)P(t) and the optimal feedback u(t,x)u^*(t, x) from HJB by direct substitution.

ExerciseAdvanced

Problem

Show that the HJB equation reduces to the linear backward PDE that Feynman–Kac inverts when there is no control: take U={u0}U = \{u_0\} a single point, drift b(x,u0)=b0(x)b(x, u_0) = b_0(x), diffusion σ(x,u0)=σ0(x)\sigma(x, u_0) = \sigma_0(x), running cost f(x,u0)=f0(x)f(x, u_0) = f_0(x), and verify that the Feynman–Kac representation V(t,x)=E[g(XT)+tTf0(Xs)dsXt=x]V(t, x) = \mathbb{E}[g(X_T) + \int_t^T f_0(X_s)\,ds \mid X_t = x] recovers exactly the value function defined by the cost integral.

References

No canonical references provided.

Next Topics

Last reviewed: April 18, 2026

Prerequisites

Foundations this topic depends on.

Next Topics