Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

RL Theory

Policy Representations

How to parameterize policies in reinforcement learning: categorical for discrete actions, Gaussian for continuous actions, and why the choice affects gradient variance and exploration.

CoreTier 2Stable~40 min
0

Why This Matters

A policy maps states to actions. The way you represent this mapping determines what gradients look like, how exploration happens, and what algorithms are applicable. Choosing a categorical distribution for discrete actions or a Gaussian for continuous actions is not merely a software decision. It shapes the optimization landscape, the variance of gradient estimates, and whether the policy can explore effectively.

Mental Model

Think of a policy as a conditional probability distribution π(as)\pi(a \mid s). For discrete actions, this is a categorical distribution: a vector of probabilities over finitely many choices. For continuous actions, this is typically a Gaussian: output a mean and standard deviation, then sample. The choice of distribution family determines how the policy explores (through its stochasticity) and how gradient information flows from returns back to parameters.

Deterministic vs. Stochastic Policies

Definition

Deterministic Policy

A deterministic policy maps each state to a single action: a=μ(s)a = \mu(s). There is no randomness. Used in DPG (deterministic policy gradient) and DDPG. Requires separate exploration mechanisms (e.g., additive noise) since the policy itself does not explore.

Definition

Stochastic Policy

A stochastic policy outputs a probability distribution over actions. Actions are sampled: aπ(s)a \sim \pi(\cdot \mid s). The stochasticity serves dual purposes: exploration (trying different actions) and enabling the policy gradient theorem (which requires differentiable log-probabilities).

Stochastic policies are standard for policy gradient methods. Deterministic policies are used when the action space is continuous and you want to avoid the variance introduced by sampling.

Categorical Policies (Discrete Actions)

For an action space A={1,2,,K}A = \{1, 2, \ldots, K\}, a categorical policy outputs logits z(s)RKz(s) \in \mathbb{R}^K from a neural network, then applies softmax:

π(a=ks)=exp(zk(s))j=1Kexp(zj(s))\pi(a = k \mid s) = \frac{\exp(z_k(s))}{\sum_{j=1}^{K} \exp(z_j(s))}

The log-probability is logπ(a=ks)=zk(s)logjexp(zj(s))\log \pi(a = k \mid s) = z_k(s) - \log \sum_j \exp(z_j(s)), which is differentiable with respect to the network parameters. Actions are sampled from the resulting multinomial distribution.

Temperature scaling. Dividing logits by a temperature τ>0\tau > 0 before softmax controls the entropy of the policy. As τ0\tau \to 0, the policy becomes deterministic (greedy). As τ\tau \to \infty, the policy becomes uniform. This provides a knob for trading off exploration and exploitation.

Gaussian Policies (Continuous Actions)

For continuous action spaces ARdA \subseteq \mathbb{R}^d, the standard parameterization is a diagonal Gaussian:

π(as)=N(aμθ(s),diag(σθ(s)2))\pi(a \mid s) = \mathcal{N}(a \mid \mu_\theta(s), \text{diag}(\sigma_\theta(s)^2))

where μθ(s)Rd\mu_\theta(s) \in \mathbb{R}^d and σθ(s)R>0d\sigma_\theta(s) \in \mathbb{R}^d_{>0} are outputs of a neural network with parameters θ\theta. The log-probability is:

logπ(as)=12i=1d[(aiμi(s))2σi(s)2+log(2πσi(s)2)]\log \pi(a \mid s) = -\frac{1}{2} \sum_{i=1}^{d} \left[ \frac{(a_i - \mu_i(s))^2}{\sigma_i(s)^2} + \log(2\pi \sigma_i(s)^2) \right]

State-dependent vs. fixed variance. Some implementations output only μθ(s)\mu_\theta(s) and use a fixed or state-independent σ\sigma (a separate learnable parameter not conditioned on ss). State-dependent variance allows the policy to be confident in familiar states and exploratory in unfamiliar ones.

Squashed Gaussian. For bounded action spaces [amin,amax][a_{\min}, a_{\max}], sample uN(μ,σ2)u \sim \mathcal{N}(\mu, \sigma^2) and apply a=tanh(u)a = \tanh(u). The log-probability requires a change-of-variables correction: logπ(as)=logN(uμ,σ2)ilog(1tanh2(ui))\log \pi(a \mid s) = \log \mathcal{N}(u \mid \mu, \sigma^2) - \sum_i \log(1 - \tanh^2(u_i)). This is standard in SAC (Soft Actor-Critic).

The Reparameterization Trick

Definition

Reparameterization Trick

Instead of sampling aπθ(s)a \sim \pi_\theta(\cdot \mid s) directly, write a=gθ(s,ϵ)a = g_\theta(s, \epsilon) where ϵp(ϵ)\epsilon \sim p(\epsilon) is a fixed noise distribution independent of θ\theta. For a Gaussian policy: a=μθ(s)+σθ(s)ϵa = \mu_\theta(s) + \sigma_\theta(s) \odot \epsilon with ϵN(0,I)\epsilon \sim \mathcal{N}(0, I). This makes aa a deterministic, differentiable function of θ\theta for a given ϵ\epsilon, enabling low-variance gradient estimation via backpropagation through the sampling operation.

The reparameterization trick is critical for continuous policies. Without it, gradients must use the score function estimator (REINFORCE), which has high variance. With reparameterization, the gradient flows through gθg_\theta directly, reducing variance by orders of magnitude.

Policy Gradient and Representation

Theorem

Score Function Policy Gradient

Statement

For a stochastic policy πθ(as)\pi_\theta(a \mid s), the gradient of the expected return J(θ)=Eπθ[tγtrt]J(\theta) = \mathbb{E}_{\pi_\theta}[\sum_t \gamma^t r_t] is:

θJ(θ)=Eπθ[t=0γtθlogπθ(atst)Qπθ(st,at)]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t=0}^{\infty} \gamma^t \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, Q^{\pi_\theta}(s_t, a_t)\right]

where Qπθ(s,a)Q^{\pi_\theta}(s, a) is the action-value function under πθ\pi_\theta.

Intuition

The gradient θlogπθ(as)\nabla_\theta \log \pi_\theta(a \mid s) (the score function) points in the direction that increases the probability of action aa. Weighting by Qπθ(s,a)Q^{\pi_\theta}(s,a) means: increase the probability of actions that lead to high returns, decrease for low returns. The policy representation determines the shape of the score function and therefore the gradient signal.

Proof Sketch

By the log-derivative trick: θπθ(as)=πθ(as)θlogπθ(as)\nabla_\theta \pi_\theta(a \mid s) = \pi_\theta(a \mid s) \nabla_\theta \log \pi_\theta(a \mid s). Write J(θ)J(\theta) as an expectation over trajectories. Differentiate under the expectation (valid by regularity assumptions). The probability of a trajectory factorizes as a product of transition probabilities and policy probabilities. Only the policy terms depend on θ\theta, giving the sum of score functions weighted by future returns.

Why It Matters

This theorem is the foundation of REINFORCE, PPO, A2C, and all policy gradient methods. The variance of the gradient estimate depends directly on the policy representation: categorical policies produce discrete score functions, Gaussian policies produce continuous ones, and the reparameterization trick provides an alternative lower-variance estimator for the Gaussian case.

Failure Mode

The score function estimator has high variance, especially for long horizons or large action spaces. Without baselines or variance reduction, the gradient signal is drowned in noise. For continuous actions, the reparameterization trick (when applicable) is strongly preferred over the score function estimator.

How Representation Affects Gradient Variance

Categorical policies: The score function θlogπθ(as)\nabla_\theta \log \pi_\theta(a \mid s) changes sign depending on which action was sampled. With KK actions, most probability mass may be on one action, making gradients for rare actions noisy. Entropy regularization helps by keeping the policy from becoming too peaked.

Gaussian policies with REINFORCE: The score function is (aμ)/σ2(a - \mu)/\sigma^2 for the mean and ((aμ)2σ2)/σ3((a - \mu)^2 - \sigma^2)/\sigma^3 for the variance. High-magnitude actions produce large score function values, inflating gradient variance.

Gaussian policies with reparameterization: The gradient θQ(s,μθ(s)+σθ(s)ϵ)\nabla_\theta Q(s, \mu_\theta(s) + \sigma_\theta(s) \epsilon) is computed by backpropagation. Variance depends on the smoothness of QQ rather than on the score function. This is typically much lower variance.

Common Confusions

Watch Out

Softmax temperature is not the same as entropy regularization

Temperature scaling (z/τz/\tau) changes the entropy of the policy at sampling time. Entropy regularization adds αH(π(s))\alpha H(\pi(\cdot \mid s)) to the objective. They have similar effects (preventing premature convergence to deterministic policies) but operate differently: temperature is a hyperparameter applied to logits, while entropy regularization adds a gradient signal that pushes the policy toward stochasticity.

Watch Out

The reparameterization trick does not apply to categorical policies

Categorical distributions are not reparameterizable because the sample is discrete and the mapping from noise to action is not differentiable. Gumbel-Softmax provides a continuous relaxation that approximates reparameterization for discrete actions, but it is biased and requires a temperature parameter. For exact discrete policy gradients, the score function estimator is necessary.

Key Takeaways

  • Categorical policies (softmax over logits) are standard for discrete action spaces
  • Diagonal Gaussian policies (learned mean and variance) are standard for continuous action spaces
  • Squashed Gaussians handle bounded action spaces via tanh transformation
  • The reparameterization trick reduces gradient variance for continuous policies by making sampling differentiable
  • Policy representation directly affects gradient variance, exploration behavior, and algorithm choice
  • Temperature and entropy regularization prevent premature convergence to deterministic policies

Exercises

ExerciseCore

Problem

A categorical policy over 3 actions outputs logits z=(2.0,1.0,0.5)z = (2.0, 1.0, 0.5). Compute the action probabilities and the score function zlogπ(a=1s)\nabla_z \log \pi(a = 1 \mid s).

ExerciseAdvanced

Problem

For a 1D Gaussian policy πθ(as)=N(aμ,σ2)\pi_\theta(a \mid s) = \mathcal{N}(a \mid \mu, \sigma^2) with μ=0\mu = 0 and σ=1\sigma = 1, compute the variance of the score function estimator μlogπ(as)a\nabla_\mu \log \pi(a \mid s) \cdot a where aπa \sim \pi. Compare this to the variance of the reparameterization gradient μQ(μ+ϵ)\nabla_\mu Q(\mu + \epsilon) where Q(a)=aQ(a) = a and ϵN(0,1)\epsilon \sim \mathcal{N}(0,1).

References

Canonical:

  • Sutton and Barto, Reinforcement Learning: An Introduction (2018), Chapters 13.1-13.4
  • Williams, Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning, Machine Learning (1992)

Current:

  • Haarnoja et al., Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL with a Stochastic Actor, ICML (2018)
  • Schulman et al., Proximal Policy Optimization Algorithms, arXiv (2017)
  • Kingma and Welling, Auto-Encoding Variational Bayes, ICLR (2014), Section 2.4 (reparameterization trick)

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics