Skip to main content

Mathematical Infrastructure

Backward Stochastic Differential Equations

The Pardoux–Peng framework: an SDE with a terminal condition and an adapted solution pair (Y_t, Z_t). Linear BSDEs reduce to Feynman–Kac; nonlinear BSDEs are dual to Hamilton–Jacobi–Bellman PDEs and are the mathematical object that the deep BSDE method approximates.

AdvancedTier 2Stable~50 min
0

Why This Matters

A linear parabolic PDE has a Feynman–Kac representation as an expectation over SDE trajectories. A nonlinear parabolic PDE — one whose lower-order term depends on the solution uu and its gradient u\nabla u themselves — does not. The clean expectation breaks because the integrand inside the expectation depends on the unknown solution, and you cannot Monte Carlo what you do not yet know. The fix is the backward stochastic differential equation of Pardoux and Peng (1990): an SDE that you solve backward from a terminal condition, with an extra adapted process ZtZ_t that enforces measurability.

BSDEs are the natural mathematical object on the bridge between Feynman–Kac (linear case) and Hamilton–Jacobi–Bellman (fully nonlinear case). The driver ff in a BSDE plays the role of the nonlinear lower-order term in a semilinear parabolic PDE; the pair (Yt,Zt)(Y_t, Z_t) encodes both the solution and its diffusion-weighted gradient along an SDE path. When ff is linear in (y,z)(y, z), the BSDE collapses to classical Feynman–Kac. When ff is convex in zz, the BSDE represents a stochastic-control value function and recovers the HJB equation.

The historical path is older than 1990. Bismut (1973) wrote down a linear BSDE as the adjoint equation in stochastic control. Pardoux and Peng's contribution was the existence-uniqueness theorem in the nonlinear Lipschitz setting, which made BSDEs an autonomous object rather than a side-equation in optimization. El Karoui, Peng, and Quenez (1997) then turned BSDEs into a working tool for mathematical finance: pricing under constraints, recursive utility, gg-expectations.

The reason BSDEs matter for ML is downstream. The deep BSDE method of Han, Jentzen, and E (2018) parameterizes the ZZ process with a neural network and solves the BSDE by forward shooting under a terminal-loss. This is the only known method that handles fully nonlinear PDEs in d=100d = 100 dimensions with a few percent error. Every line of the algorithm is a discretization of the Pardoux–Peng BSDE and inherits its existence guarantees.

Mental Model

A forward SDE specifies an initial condition and runs forward in time. A BSDE specifies a terminal condition YT=ξY_T = \xi and asks for a process YtY_t that lands on ξ\xi at time TT. The catch: YtY_t must be adapted to the forward Brownian filtration. You cannot just solve Yt=E[ξ+tTf(s,Ys,Zs)dsFt]Y_t = \mathbb{E}[\xi + \int_t^T f(s, Y_s, Z_s)\,ds \mid \mathcal{F}_t] pointwise, because that requires knowing the future YsY_s for s>ts > t to compute the integrand.

The trick is to add a second unknown process ZtZ_t — a "control" or "martingale-representation" coefficient — and solve the system jointly. The pair (Y,Z)(Y, Z) is what makes backward solvability work. The role of ZZ becomes transparent in the Markovian case: when Yt=u(t,Xt)Y_t = u(t, X_t) for some function uu and forward state XtX_t, the martingale representation theorem forces Zt=σ(Xt)u(t,Xt)Z_t = \sigma^\top(X_t) \nabla u(t, X_t). So ZZ is the diffusion-weighted gradient of the value function along the path. This is exactly what the deep BSDE method parameterizes.

Formal Statement

Definition

Backward Stochastic Differential Equation

Fix a horizon T>0T > 0, a dd-dimensional Brownian motion BtB_t on a filtered probability space (Ω,F,{Ft},P)(\Omega, \mathcal{F}, \{\mathcal{F}_t\}, \mathbb{P}), and a terminal random variable ξL2(FT)\xi \in L^2(\mathcal{F}_T). A driver is a measurable function f:[0,T]×R×RdRf: [0, T] \times \mathbb{R} \times \mathbb{R}^d \to \mathbb{R}. A solution to the BSDE with terminal value ξ\xi and driver ff is a pair of adapted processes (Yt,Zt)R×Rd(Y_t, Z_t) \in \mathbb{R} \times \mathbb{R}^d satisfying

Yt=ξ+tTf(s,Ys,Zs)dstTZsdBs,t[0,T],Y_t = \xi + \int_t^T f(s, Y_s, Z_s)\,ds - \int_t^T Z_s^\top\,dB_s, \qquad t \in [0, T],

or equivalently in differential form dYt=f(t,Yt,Zt)dtZtdBt-dY_t = f(t, Y_t, Z_t)\,dt - Z_t^\top dB_t, with terminal condition YT=ξY_T = \xi. The solution lives in S2×H2S^2 \times H^2: YS2Y \in S^2 means E[suptTYt2]<\mathbb{E}[\sup_{t \le T} \lvert Y_t \rvert^2] < \infty, and ZH2Z \in H^2 means E[0TZt2dt]<\mathbb{E}[\int_0^T \lvert Z_t \rvert^2\,dt] < \infty.

The vector-valued generalization replaces YtRY_t \in \mathbb{R} by YtRkY_t \in \mathbb{R}^k and ZtZ_t by a k×dk \times d matrix; the same theory carries through with obvious notational changes.

The minus sign on the stochastic integral is the convention that makes tTZsdBs\int_t^T Z_s^\top\,dB_s a forward Itô integral; the integration variable ss runs forward in time even though the equation is "solved backward" from the terminal condition. This is not pathwise time reversal; the filtration is still the forward Brownian filtration. The "backward" in BSDE refers strictly to the direction in which the boundary condition is imposed.

Pardoux–Peng Existence and Uniqueness

Theorem

Pardoux–Peng Existence and Uniqueness Theorem

Statement

Under the assumptions above, the BSDE Yt=ξ+tTf(s,Ys,Zs)dstTZsdBsY_t = \xi + \int_t^T f(s, Y_s, Z_s) \,ds - \int_t^T Z_s^\top\,dB_s has a unique solution (Y,Z)S2×H2(Y, Z) \in S^2 \times H^2. Moreover, the solution depends continuously on the data (ξ,f)(\xi, f) in the natural L2L^2-norms, and a comparison principle holds: if (ξ1,f1)(ξ2,f2)(\xi_1, f_1) \le (\xi_2, f_2) pointwise, then Yt(1)Yt(2)Y^{(1)}_t \le Y^{(2)}_t almost surely for every tt.

Intuition

The driver ff is Lipschitz, which makes the operator that maps a candidate (Y,Z)(Y, Z) to the next iterate (defined via conditional expectation against the terminal value) a contraction in a suitable weighted norm. Banach fixed-point gives existence and uniqueness. The role of ZZ is forced by the martingale representation theorem: any square-integrable martingale on the Brownian filtration is a stochastic integral against BB, and ZZ is the integrand that makes YY an Itô process.

Proof Sketch

Define the map Φ:H2×H2H2×H2\Phi: H^2 \times H^2 \to H^2 \times H^2 as follows. Given a candidate (y,z)(y, z), set Mt=E[ξ+0Tf(s,ys,zs)dsFt]M_t = \mathbb{E}[\xi + \int_0^T f(s, y_s, z_s) \,ds \mid \mathcal{F}_t]. By the martingale representation theorem there is a unique ZH2Z' \in H^2 with Mt=M0+0tZsdBsM_t = M_0 + \int_0^t Z'_s{}^\top\,dB_s. Define Yt=Mt0tf(s,ys,zs)dsY'_t = M_t - \int_0^t f(s, y_s, z_s)\,ds, equivalently Yt=ξ+tTf(s,ys,zs)dstTZsdBsY'_t = \xi + \int_t^T f(s, y_s, z_s)\,ds - \int_t^T Z'_s{}^\top\,dB_s. Then Φ(y,z)=(Y,Z)\Phi(y, z) = (Y', Z'). Equip H2×H2H^2 \times H^2 with the weighted norm (y,z)β2=E[0Teβt(yt2+zt2)dt]\|(y, z)\|_\beta^2 = \mathbb{E}[\int_0^T e^{\beta t} (\lvert y_t \rvert^2 + \lvert z_t \rvert^2)\,dt]. Itô's formula applied to eβtY(1)Y(2)2e^{\beta t}\lvert Y'^{(1)} - Y'^{(2)} \rvert^2 together with the Lipschitz hypothesis on ff gives a contraction estimate Φ(y1,z1)Φ(y2,z2)βρ(β)(y1,z1)(y2,z2)β\|\Phi(y_1, z_1) - \Phi(y_2, z_2)\|_\beta \le \rho(\beta) \|(y_1, z_1) - (y_2, z_2)\|_\beta with ρ(β)<1\rho(\beta) < 1 for β\beta large enough. Banach fixed-point closes the proof.

Why It Matters

This is the foundational well-posedness result. Without it, BSDEs would be formal manipulations with no guarantee that the equations they appear in have meaning. Three downstream consequences. First, the nonlinear Feynman–Kac representation (next theorem) inherits existence-uniqueness from this result; without it, the connection to semilinear PDEs would be one-sided. Second, the comparison principle is the BSDE analog of the maximum principle for parabolic PDEs and is the foundation of gg-expectations and BSDE-based risk measures. Third, the contraction estimate is what licenses Picard iteration as a numerical scheme, and in the deep BSDE method it is what guarantees that the forward-shooting loss has a unique global minimum once the gradient network is expressive enough. There exists a unique pair (Y,Z)S2×H2(Y, Z) \in S^2 \times H^2 solving the BSDE.

Failure Mode

The Lipschitz hypothesis on ff is essential for the contraction step. Drivers with quadratic growth in zz (e.g., f(t,y,z)=12z2+g(t,y)f(t, y, z) = \tfrac{1}{2} \lvert z \rvert^2 + g(t, y), arising in exponential utility maximization) fall outside Pardoux–Peng. Existence in that regime requires a different proof technique due to Kobylanski (2000) based on an a-priori sup-norm bound for YY and an exponential transformation. Uniqueness is harder still and was settled only later (Briand and Hu 2008, Delbaen et al. 2011). The clean BSDE theory is the Lipschitz case; quadratic BSDEs are a separate and substantially more involved chapter.

Nonlinear Feynman–Kac

The reason BSDEs matter for PDE theory is the Markovian case in which the terminal value and driver are functions of a forward SDE. Pardoux and Peng (1992) proved that the BSDE solution is then exactly the value function of a semilinear parabolic PDE.

Theorem

Nonlinear Feynman–Kac (Pardoux–Peng 1992)

Statement

Let XX solve the forward SDE above and let (Yt,x,Zt,x)(Y^{t,x}, Z^{t,x}) solve the associated BSDE on [t,T][t, T] with terminal value g(XT)g(X_T) and driver h(s,Xs,Ys,Zs)h(s, X_s, Y_s, Z_s). Define u(t,x)=Ytt,xu(t, x) = Y^{t,x}_t. Then uu is the unique viscosity solution of the semilinear parabolic PDE

tu+Lu+h(t,x,u,σu)=0,u(T,x)=g(x),\partial_t u + \mathcal{L} u + h(t, x, u, \sigma^\top \nabla u) = 0, \qquad u(T, x) = g(x),

where Lu=bu+12Tr(σσ2u)\mathcal{L} u = b \cdot \nabla u + \tfrac{1}{2} \operatorname{Tr} (\sigma \sigma^\top \nabla^2 u) is the generator of XX. When uC1,2u \in C^{1,2}, the BSDE solution along the path admits the Markovian representation Yst,x=u(s,Xs)Y^{t,x}_s = u(s, X_s) and Zst,x=σ(s,Xs)u(s,Xs)Z^{t,x}_s = \sigma^\top(s, X_s) \nabla u(s, X_s) for s[t,T]s \in [t, T].

Intuition

Apply Itô's formula to u(s,Xs)u(s, X_s). The drift bracket su+Lu\partial_s u + \mathcal{L} u is forced by the PDE to equal h(s,Xs,u,σu)-h(s, X_s, u, \sigma^\top \nabla u), and the diffusion bracket σu\sigma^\top \nabla u is the integrand against dBsdB_s. Comparing with the BSDE equation dYs=f(s,Ys,Zs)ds+ZsdBsdY_s = -f(s, Y_s, Z_s)\,ds + Z_s^\top dB_s identifies Ys=u(s,Xs)Y_s = u(s, X_s) and Zs=σ(s,Xs)u(s,Xs)Z_s = \sigma^\top(s, X_s) \nabla u(s, X_s). The terminal condition u(T,XT)=g(XT)=ξu(T, X_T) = g(X_T) = \xi matches automatically.

Proof Sketch

The forward direction (PDE solution gives BSDE solution) is the Itô calculation just sketched. For the reverse direction (BSDE solution gives PDE viscosity solution), one shows that u(t,x)=Ytt,xu(t, x) = Y^{t,x}_t inherits continuity in (t,x)(t, x) from the BSDE's continuous dependence on initial data, then verifies the viscosity sub- and super-solution inequalities via a Markov-property argument and the comparison principle for BSDEs. Pardoux and Peng (1992) handle the smooth case; Pardoux (1999) and Barles, Buckdahn, and Pardoux (1997) extend to viscosity solutions and to PDEs with reflection or jumps.

Why It Matters

This is the nonlinear Feynman–Kac formula. It generalizes the linear formula u(t,x)=E[g(XT)Xt=x]u(t, x) = \mathbb{E}[g(X_T) \mid X_t = x] to PDEs whose lower-order term depends on uu and u\nabla u themselves. The expectation representation breaks (you cannot integrate against an unknown uu), and the BSDE replaces it with an implicit fixed-point representation. When hh is convex in zz (the typical situation in stochastic control), the PDE above is the Hamilton–Jacobi–Bellman equation of a control problem and the BSDE is its dual. This is the mathematical content of the duality between forward stochastic control and backward representation that Bismut (1973) first identified. The BSDE solution is Yt=u(t,Xt)Y_t = u(t, X_t) and Zt=σ(t,Xt)u(t,Xt)Z_t = \sigma^\top(t, X_t) \nabla u(t, X_t), where uu solves the semilinear parabolic PDE tu+Lu+h(t,x,u,σu)=0\partial_t u + \mathcal{L} u + h(t, x, u, \sigma^\top \nabla u) = 0 with terminal condition u(T,x)=g(x)u(T, x) = g(x).

Failure Mode

Fully nonlinear PDEs (those involving 2u\nabla^2 u inside the nonlinearity, like Monge–Ampère or the second-order HJB with controlled diffusion) are not covered. The natural representation there is the second-order BSDE of Cheridito, Soner, Touzi, and Victoir (2007) and Soner, Touzi, and Zhang (2012), which adds a third process Γ\Gamma representing the Hessian. The clean Pardoux–Peng theory is restricted to semilinear PDEs where the second-order term is fixed by the forward diffusion.

The Y/Z Decomposition

In the Markovian case, the BSDE solution decomposes cleanly: Yt=u(t,Xt)Y_t = u(t, X_t) is the value function evaluated along the path, and Zt=σ(t,Xt)u(t,Xt)Z_t = \sigma^\top(t, X_t) \nabla u(t, X_t) is the diffusion-weighted gradient of the value function. The two processes carry complementary information: YY tracks the level of uu along the trajectory, and ZZ tracks the slope.

This decomposition is what the deep BSDE method exploits. The algorithm parameterizes Y0Y_0 as a single trainable scalar (the unknown PDE solution at the initial point) and ZtkZ_{t_k} as a neural network ϕθk:RdRd\phi_{\theta_k}: \mathbb{R}^d \to \mathbb{R}^d at each time step tkt_k. The forward Euler update Ytk+1=Ytkf(tk,Ytk,Ztk)Δt+ZtkΔBkY_{t_{k+1}} = Y_{t_k} - f(t_k, Y_{t_k}, Z_{t_k})\,\Delta t + Z_{t_k}^\top \Delta B_k propagates the candidate trajectory, and a terminal L2L^2 loss E[YtNg(XtN)2]\mathbb{E}[\lvert Y_{t_N} - g(X_{t_N}) \rvert^2] drives optimization. The dd-dependence enters polynomially through the network input dimension, never through a spatial grid.

Worked Example: Linear BSDE Recovers Discounted Feynman–Kac

Take a driver linear in yy and independent of zz: f(s,y,z)=c(s)yh(s)f(s, y, z) = -c(s) y - h(s) for deterministic functions c,hc, h. The BSDE is

dYs=(c(s)Ysh(s))dsZsdBs,YT=ξ.-dY_s = \big(-c(s) Y_s - h(s)\big)\,ds - Z_s^\top dB_s, \qquad Y_T = \xi.

This is a stochastic linear ODE in YY with random terminal condition. Apply the integrating factor etsc(r)dre^{\int_t^s c(r)\,dr} and rearrange to get Yt=E[Λt,Tξ+tTΛt,sh(s)dsFt]Y_t = \mathbb{E}\big[\,\Lambda_{t, T}\, \xi + \int_t^T \Lambda_{t, s} h(s) \,ds \,\big|\, \mathcal{F}_t\,\big], where Λt,s=exp(tsc(r)dr)\Lambda_{t, s} = \exp(-\int_t^s c(r)\,dr) is a discount factor. In Markovian form with ξ=g(XT)\xi = g(X_T) and all coefficients depending on XX, this is exactly the discounted Feynman–Kac formula of the Feynman–Kac topic page: Yt=u(t,Xt)Y_t = u(t, X_t) where uu solves tu+Lucu+h=0\partial_t u + \mathcal{L} u - c\, u + h = 0 with u(T,x)=g(x)u(T, x) = g(x). The ZZ process is recovered as Zt=σ(t,Xt)u(t,Xt)Z_t = \sigma^\top(t, X_t) \nabla u(t, X_t), the integrand in the martingale representation of YtY_t against BtB_t.

The lesson: the linear BSDE is exactly the discounted Feynman–Kac formula in disguise. The BSDE machinery is non-trivial only when ff depends on yy or zz in a genuinely nonlinear way.

Common Confusions

Watch Out

The 'backward' in BSDE refers to the terminal condition, not pathwise time reversal

A BSDE is not a forward SDE run with the time direction flipped. The filtration is still the forward Brownian filtration Ft=σ(Bs:st)\mathcal{F}_t = \sigma(B_s : s \le t), and the stochastic integral tTZsdBs\int_t^T Z_s^\top \,dB_s is a forward Itô integral. What is "backward" is the location of the boundary condition: instead of an initial condition Y0=y0Y_0 = y_0, the equation imposes a terminal condition YT=ξY_T = \xi. This is closer in spirit to a parabolic PDE solved backward from a Cauchy datum at time TT than to a time-reversed SDE in the Anderson sense. The two notions are unrelated.

Watch Out

The Z process is not a free parameter; it is forced by martingale representation

The unknowns of a BSDE are both YY and ZZ, but they are not independent. Once YY is required to be adapted and to satisfy the integral equation with terminal value ξ\xi, the martingale representation theorem forces ZZ to be the integrand of the martingale Mt=E[ξ+0TfdsFt]M_t = \mathbb{E}[\xi + \int_0^T f \,ds \mid \mathcal{F}_t] against BB. The pair (Y,Z)(Y, Z) is jointly determined; you do not get to choose ZZ separately. This is also why the BSDE is well-posed: the extra unknown ZZ is exactly compensated by the extra structural constraint that YY be adapted.

Watch Out

Quadratic-growth drivers are genuinely outside Pardoux–Peng

Pardoux–Peng requires ff to be uniformly Lipschitz in (y,z)(y, z). Drivers with quadratic growth in zz, common in entropic risk measures and exponential utility, violate the Lipschitz hypothesis and require the separate theory of Kobylanski (2000). The trick there is an exponential transformation Y~t=exp(ηYt)\tilde{Y}_t = \exp(\eta Y_t) that linearizes the quadratic term, plus an a-priori sup-norm bound on YY from the boundedness of ξ\xi. Existence holds; uniqueness is much harder and depends on additional structural assumptions on ff. Treating quadratic BSDEs as a "slight extension" of Lipschitz BSDEs underestimates the difficulty.

Exercises

ExerciseCore

Problem

Solve the linear BSDE dYt=(aYt+b)dtZtdBt-dY_t = (a Y_t + b)\,dt - Z_t\,dB_t on [0,T][0, T] with terminal condition YT=ξY_T = \xi, where a,bRa, b \in \mathbb{R} are constants and ξL2(FT)\xi \in L^2(\mathcal{F}_T). Give explicit formulas for YtY_t and ZtZ_t.

ExerciseAdvanced

Problem

Prove the contraction step in Pardoux–Peng. Let Φ:H2×H2H2×H2\Phi: H^2 \times H^2 \to H^2 \times H^2 be the map defined in the proof sketch above. Equip the codomain with the β\beta-weighted norm (Y,Z)β2=E[0Teβt(Yt2+Zt2)dt]\|(Y, Z)\|_\beta^2 = \mathbb{E} [\int_0^T e^{\beta t}(\lvert Y_t \rvert^2 + \lvert Z_t \rvert^2)\,dt]. Show that for β\beta large enough (depending on the Lipschitz constant KK of ff), Φ\Phi is a strict contraction in this norm.

References

No canonical references provided.

Next Topics

  • Deep BSDE Method: the Han–Jentzen–E neural-network solver that parameterizes ZtkZ_{t_k} with a network at each time step and minimizes a terminal-condition L2L^2 loss.
  • Hamilton–Jacobi–Bellman Equation: the PDE that arises when the BSDE driver ff is a control Hamiltonian; the canonical setting where BSDE duality replaces dynamic programming.
  • Feynman–Kac Formula: the linear case that BSDEs generalize; the BSDE collapses to the discounted Feynman–Kac expectation when the driver is linear.
  • Stochastic Differential Equations: the forward equation whose path the BSDE is solved along in the Markovian case.
  • Itô's Lemma: the chain rule that produces the Markovian BSDE from a C1,2C^{1,2} solution of the associated semilinear PDE.

Last reviewed: April 18, 2026

Prerequisites

Foundations this topic depends on.

Next Topics