Policy Representations

Sneiderman, Robby

RL Theory

Policy Representations

How to parameterize policies in reinforcement learning: categorical for discrete actions, Gaussian for continuous actions, and why the choice affects gradient variance and exploration.

CoreTier 2StableSupporting~40 min

Prerequisites

Markov Decision Processes

Prereq Map

Learning position

Read this page in the graph.

rl-theory | layer 3 | tier 2. This page has 1 direct prerequisite and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Options and Temporal Abstraction

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

A policy maps states to actions. The way you represent this mapping determines what gradients look like, how exploration happens, and what algorithms are applicable. Choosing a categorical distribution for discrete actions or a Gaussian for continuous actions is not merely a software decision. It shapes the optimization landscape, the variance of gradient estimates, and whether the policy can explore effectively.

Mental Model

Think of a policy as a conditional probability distribution $\pi(a \mid s)$ . For discrete actions, this is a categorical distribution: a vector of probabilities over finitely many choices. For continuous actions, this is typically a Gaussian: output a mean and standard deviation, then sample. The choice of distribution family determines how the policy explores (through its stochasticity) and how gradient information flows from returns back to parameters.

Deterministic vs. Stochastic Policies

Definition

Deterministic Policy $μ : S \to A$

A deterministic policy maps each state to a single action: $a = \mu(s)$ . There is no randomness. Used in DPG (deterministic policy gradient) and DDPG. Requires separate exploration mechanisms (e.g., additive noise) since the policy itself does not explore.

Definition

Stochastic Policy $π (a ∣ s)$

A stochastic policy outputs a probability distribution over actions. Actions are sampled: $a \sim \pi(\cdot \mid s)$ . The stochasticity serves dual purposes: exploration (trying different actions) and enabling the policy gradient theorem (which requires differentiable log-probabilities).

Stochastic policies are standard for policy gradient methods. Deterministic policies are used when the action space is continuous and you want to avoid the variance introduced by sampling.

Categorical Policies (Discrete Actions)

For an action space $A = \{1, 2, \ldots, K\}$ , a categorical policy outputs logits $z(s) \in \mathbb{R}^K$ from a neural network, then applies softmax:

$\pi(a = k \mid s) = \frac{\exp(z_k(s))}{\sum_{j=1}^{K} \exp(z_j(s))}$

The log-probability is $\log \pi(a = k \mid s) = z_k(s) - \log \sum_j \exp(z_j(s))$ , which is differentiable with respect to the network parameters. Actions are sampled from the resulting multinomial distribution.

Temperature scaling. Dividing logits by a temperature $\tau > 0$ before softmax controls the entropy of the policy. As $\tau \to 0$ , the policy becomes deterministic (greedy). As $\tau \to \infty$ , the policy becomes uniform. This provides a knob for trading off exploration and exploitation.

Gaussian Policies (Continuous Actions)

For continuous action spaces $A \subseteq \mathbb{R}^d$ , the standard parameterization is a diagonal Gaussian:

$\pi(a \mid s) = \mathcal{N}(a \mid \mu_\theta(s), \text{diag}(\sigma_\theta(s)^2))$

where $\mu_\theta(s) \in \mathbb{R}^d$ and $\sigma_\theta(s) \in \mathbb{R}^d_{>0}$ are outputs of a neural network with parameters $\theta$ . The log-probability is:

$\log \pi(a \mid s) = -\frac{1}{2} \sum_{i=1}^{d} \left[ \frac{(a_i - \mu_i(s))^2}{\sigma_i(s)^2} + \log(2\pi \sigma_i(s)^2) \right]$

State-dependent vs. fixed variance. Some implementations output only $\mu_\theta(s)$ and use a fixed or state-independent $\sigma$ (a separate learnable parameter not conditioned on $s$ ). State-dependent variance allows the policy to be confident in familiar states and exploratory in unfamiliar ones.

Squashed Gaussian. For bounded action spaces $[a_{\min}, a_{\max}]$ , sample $u \sim \mathcal{N}(\mu, \sigma^2)$ and apply $a = \tanh(u)$ . The log-probability requires a change-of-variables correction: $\log \pi(a \mid s) = \log \mathcal{N}(u \mid \mu, \sigma^2) - \sum_i \log(1 - \tanh^2(u_i))$ . This is standard in SAC (Soft Actor-Critic).

The Reparameterization Trick

Definition

Reparameterization Trick

Instead of sampling $a \sim \pi_\theta(\cdot \mid s)$ directly, write $a = g_\theta(s, \epsilon)$ where $\epsilon \sim p(\epsilon)$ is a fixed noise distribution independent of $\theta$ . For a Gaussian policy: $a = \mu_\theta(s) + \sigma_\theta(s) \odot \epsilon$ with $\epsilon \sim \mathcal{N}(0, I)$ . This makes $a$ a deterministic, differentiable function of $\theta$ for a given $\epsilon$ , enabling low-variance gradient estimation via backpropagation through the sampling operation.

The reparameterization trick is critical for continuous policies. Without it, gradients must use the score function estimator (REINFORCE), which has high variance. With reparameterization, the gradient flows through $g_\theta$ directly, reducing variance by orders of magnitude.

Policy Gradient and Representation

Theorem

Score Function Policy Gradient

Statement

For a stochastic policy $\pi_\theta(a \mid s)$ , the gradient of the expected return $J(\theta) = \mathbb{E}_{\pi_\theta}[\sum_t \gamma^t r_t]$ is:

$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t=0}^{\infty} \gamma^t \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, Q^{\pi_\theta}(s_t, a_t)\right]$

where $Q^{\pi_\theta}(s, a)$ is the action-value function under $\pi_\theta$ .

Intuition

The gradient $\nabla_\theta \log \pi_\theta(a \mid s)$ (the score function) points in the direction that increases the probability of action $a$ . Weighting by $Q^{\pi_\theta}(s,a)$ means: increase the probability of actions that lead to high returns, decrease for low returns. The policy representation determines the shape of the score function and therefore the gradient signal.

Proof Sketch

By the log-derivative trick: $\nabla_\theta \pi_\theta(a \mid s) = \pi_\theta(a \mid s) \nabla_\theta \log \pi_\theta(a \mid s)$ . Write $J(\theta)$ as an expectation over trajectories. Differentiate under the expectation (valid by regularity assumptions). The probability of a trajectory factorizes as a product of transition probabilities and policy probabilities. Only the policy terms depend on $\theta$ , giving the sum of score functions weighted by future returns.

Why It Matters

This theorem is the foundation of REINFORCE, PPO, A2C, and all policy gradient methods. The variance of the gradient estimate depends directly on the policy representation: categorical policies produce discrete score functions, Gaussian policies produce continuous ones, and the reparameterization trick provides an alternative lower-variance estimator for the Gaussian case.

Failure Mode

The score function estimator has high variance, especially for long horizons or large action spaces. Without baselines or variance reduction, the gradient signal is drowned in noise. For continuous actions, the reparameterization trick (when applicable) is strongly preferred over the score function estimator.

report a correction →

How Representation Affects Gradient Variance

Categorical policies: The score function $\nabla_\theta \log \pi_\theta(a \mid s)$ changes sign depending on which action was sampled. With $K$ actions, most probability mass may be on one action, making gradients for rare actions noisy. Entropy regularization helps by keeping the policy from becoming too peaked.

Gaussian policies with REINFORCE: The score function is $(a - \mu)/\sigma^2$ for the mean and $((a - \mu)^2 - \sigma^2)/\sigma^3$ for the variance. High-magnitude actions produce large score function values, inflating gradient variance.

Gaussian policies with reparameterization: The gradient $\nabla_\theta Q(s, \mu_\theta(s) + \sigma_\theta(s) \epsilon)$ is computed by backpropagation. Variance depends on the smoothness of $Q$ rather than on the score function. This is typically much lower variance.

Common Confusions

Watch Out

Softmax temperature is not the same as entropy regularization

Temperature scaling ( $z/\tau$ ) changes the entropy of the policy at sampling time. Entropy regularization adds $\alpha H(\pi(\cdot \mid s))$ to the objective. They have similar effects (preventing premature convergence to deterministic policies) but operate differently: temperature is a hyperparameter applied to logits, while entropy regularization adds a gradient signal that pushes the policy toward stochasticity.

Watch Out

The reparameterization trick does not apply to categorical policies

Categorical distributions are not reparameterizable because the sample is discrete and the mapping from noise to action is not differentiable. Gumbel-Softmax provides a continuous relaxation that approximates reparameterization for discrete actions, but it is biased and requires a temperature parameter. For exact discrete policy gradients, the score function estimator is necessary.

Summary

Categorical policies (softmax over logits) are standard for discrete action spaces
Diagonal Gaussian policies (learned mean and variance) are standard for continuous action spaces
Squashed Gaussians handle bounded action spaces via tanh transformation
The reparameterization trick reduces gradient variance for continuous policies by making sampling differentiable
Policy representation directly affects gradient variance, exploration behavior, and algorithm choice
Temperature and entropy regularization prevent premature convergence to deterministic policies

Exercises

ExerciseCore

Problem

A categorical policy over 3 actions outputs logits $z = (2.0, 1.0, 0.5)$ . Compute the action probabilities and the score function $\nabla_z \log \pi(a = 1 \mid s)$ .

ExerciseAdvanced

Problem

For a 1D Gaussian policy $\pi_\theta(a \mid s) = \mathcal{N}(a \mid \mu, \sigma^2)$ with $\mu = 0$ and $\sigma = 1$ , compute the variance of the score function estimator $\nabla_\mu \log \pi(a \mid s) \cdot a$ where $a \sim \pi$ . Compare this to the variance of the reparameterization gradient $\nabla_\mu Q(\mu + \epsilon)$ where $Q(a) = a$ and $\epsilon \sim \mathcal{N}(0,1)$ .

References

Canonical:

Sutton and Barto, Reinforcement Learning: An Introduction (2018), Chapters 13.1-13.4
Williams, Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning, Machine Learning (1992)

Current:

Haarnoja et al., Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL with a Stochastic Actor, ICML (2018)
Schulman et al., Proximal Policy Optimization Algorithms, arXiv (2017)
Kingma and Welling, Auto-Encoding Variational Bayes, ICLR (2014), Section 2.4 (reparameterization trick)

Next Topics

Options and temporal abstraction: policies over extended time scales

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Markov Decision Processeslayer 2 · tier 1

Derived topics

1

Options and Temporal Abstractionlayer 3 · tier 3

Graph-backed continuations

Options and Temporal Abstraction