RL Theory
Policy Representations
How to parameterize policies in reinforcement learning: categorical for discrete actions, Gaussian for continuous actions, and why the choice affects gradient variance and exploration.
Prerequisites
Why This Matters
A policy maps states to actions. The way you represent this mapping determines what gradients look like, how exploration happens, and what algorithms are applicable. Choosing a categorical distribution for discrete actions or a Gaussian for continuous actions is not merely a software decision. It shapes the optimization landscape, the variance of gradient estimates, and whether the policy can explore effectively.
Mental Model
Think of a policy as a conditional probability distribution . For discrete actions, this is a categorical distribution: a vector of probabilities over finitely many choices. For continuous actions, this is typically a Gaussian: output a mean and standard deviation, then sample. The choice of distribution family determines how the policy explores (through its stochasticity) and how gradient information flows from returns back to parameters.
Deterministic vs. Stochastic Policies
Deterministic Policy
A deterministic policy maps each state to a single action: . There is no randomness. Used in DPG (deterministic policy gradient) and DDPG. Requires separate exploration mechanisms (e.g., additive noise) since the policy itself does not explore.
Stochastic Policy
A stochastic policy outputs a probability distribution over actions. Actions are sampled: . The stochasticity serves dual purposes: exploration (trying different actions) and enabling the policy gradient theorem (which requires differentiable log-probabilities).
Stochastic policies are standard for policy gradient methods. Deterministic policies are used when the action space is continuous and you want to avoid the variance introduced by sampling.
Categorical Policies (Discrete Actions)
For an action space , a categorical policy outputs logits from a neural network, then applies softmax:
The log-probability is , which is differentiable with respect to the network parameters. Actions are sampled from the resulting multinomial distribution.
Temperature scaling. Dividing logits by a temperature before softmax controls the entropy of the policy. As , the policy becomes deterministic (greedy). As , the policy becomes uniform. This provides a knob for trading off exploration and exploitation.
Gaussian Policies (Continuous Actions)
For continuous action spaces , the standard parameterization is a diagonal Gaussian:
where and are outputs of a neural network with parameters . The log-probability is:
State-dependent vs. fixed variance. Some implementations output only and use a fixed or state-independent (a separate learnable parameter not conditioned on ). State-dependent variance allows the policy to be confident in familiar states and exploratory in unfamiliar ones.
Squashed Gaussian. For bounded action spaces , sample and apply . The log-probability requires a change-of-variables correction: . This is standard in SAC (Soft Actor-Critic).
The Reparameterization Trick
Reparameterization Trick
Instead of sampling directly, write where is a fixed noise distribution independent of . For a Gaussian policy: with . This makes a deterministic, differentiable function of for a given , enabling low-variance gradient estimation via backpropagation through the sampling operation.
The reparameterization trick is critical for continuous policies. Without it, gradients must use the score function estimator (REINFORCE), which has high variance. With reparameterization, the gradient flows through directly, reducing variance by orders of magnitude.
Policy Gradient and Representation
Score Function Policy Gradient
Statement
For a stochastic policy , the gradient of the expected return is:
where is the action-value function under .
Intuition
The gradient (the score function) points in the direction that increases the probability of action . Weighting by means: increase the probability of actions that lead to high returns, decrease for low returns. The policy representation determines the shape of the score function and therefore the gradient signal.
Proof Sketch
By the log-derivative trick: . Write as an expectation over trajectories. Differentiate under the expectation (valid by regularity assumptions). The probability of a trajectory factorizes as a product of transition probabilities and policy probabilities. Only the policy terms depend on , giving the sum of score functions weighted by future returns.
Why It Matters
This theorem is the foundation of REINFORCE, PPO, A2C, and all policy gradient methods. The variance of the gradient estimate depends directly on the policy representation: categorical policies produce discrete score functions, Gaussian policies produce continuous ones, and the reparameterization trick provides an alternative lower-variance estimator for the Gaussian case.
Failure Mode
The score function estimator has high variance, especially for long horizons or large action spaces. Without baselines or variance reduction, the gradient signal is drowned in noise. For continuous actions, the reparameterization trick (when applicable) is strongly preferred over the score function estimator.
How Representation Affects Gradient Variance
Categorical policies: The score function changes sign depending on which action was sampled. With actions, most probability mass may be on one action, making gradients for rare actions noisy. Entropy regularization helps by keeping the policy from becoming too peaked.
Gaussian policies with REINFORCE: The score function is for the mean and for the variance. High-magnitude actions produce large score function values, inflating gradient variance.
Gaussian policies with reparameterization: The gradient is computed by backpropagation. Variance depends on the smoothness of rather than on the score function. This is typically much lower variance.
Common Confusions
Softmax temperature is not the same as entropy regularization
Temperature scaling () changes the entropy of the policy at sampling time. Entropy regularization adds to the objective. They have similar effects (preventing premature convergence to deterministic policies) but operate differently: temperature is a hyperparameter applied to logits, while entropy regularization adds a gradient signal that pushes the policy toward stochasticity.
The reparameterization trick does not apply to categorical policies
Categorical distributions are not reparameterizable because the sample is discrete and the mapping from noise to action is not differentiable. Gumbel-Softmax provides a continuous relaxation that approximates reparameterization for discrete actions, but it is biased and requires a temperature parameter. For exact discrete policy gradients, the score function estimator is necessary.
Key Takeaways
- Categorical policies (softmax over logits) are standard for discrete action spaces
- Diagonal Gaussian policies (learned mean and variance) are standard for continuous action spaces
- Squashed Gaussians handle bounded action spaces via tanh transformation
- The reparameterization trick reduces gradient variance for continuous policies by making sampling differentiable
- Policy representation directly affects gradient variance, exploration behavior, and algorithm choice
- Temperature and entropy regularization prevent premature convergence to deterministic policies
Exercises
Problem
A categorical policy over 3 actions outputs logits . Compute the action probabilities and the score function .
Problem
For a 1D Gaussian policy with and , compute the variance of the score function estimator where . Compare this to the variance of the reparameterization gradient where and .
References
Canonical:
- Sutton and Barto, Reinforcement Learning: An Introduction (2018), Chapters 13.1-13.4
- Williams, Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning, Machine Learning (1992)
Current:
- Haarnoja et al., Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL with a Stochastic Actor, ICML (2018)
- Schulman et al., Proximal Policy Optimization Algorithms, arXiv (2017)
- Kingma and Welling, Auto-Encoding Variational Bayes, ICLR (2014), Section 2.4 (reparameterization trick)
Next Topics
- Options and temporal abstraction: policies over extended time scales
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Markov Decision ProcessesLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A