Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Reinforcement Learning

Reward Design and Reward Misspecification

The hardest problem in RL: specifying what you want. Reward shaping, potential-based shaping theorem, specification gaming, Goodhart's law in RL, and the bridge from classic RL to alignment.

AdvancedTier 1Current~40 min

Why This Matters

The reward function is the specification language of RL. Everything the agent optimizes traces back to the reward signal. If the reward function accurately captures the designer's intent, the agent will (given sufficient capacity and exploration) learn the intended behavior. If the reward function is even slightly misspecified, the agent will find and exploit the gap.

This is not a hypothetical concern. OpenAI's CoastRunners boat learned to drive in circles collecting power-ups instead of finishing the race, because the score (the reward proxy) rewarded power-ups more than race completion. DeepMind's agents learned to exploit physics glitches in simulated environments. Language models fine-tuned with RLHF can learn to produce outputs that score well with the reward model while being unhelpful or dishonest.

Reward design connects classical RL theory to AI alignment. The same mathematical structure that allows reward shaping to speed up learning also shows why reward hacking is hard to prevent.

Prerequisites

This page assumes familiarity with MDPs and Bellman equations. You should understand optimal policies, value functions, and the relationship between reward and behavior.

DesignerintentProxyrewardAgentoptimizesUnintendedbehaviorShortcut found!Example: boat racing game. Reward = score. Agent goes in circles collecting power-ups instead of finishing the race.Example: content recommendation. Reward = engagement. Agent recommends outrage bait because it maximizes clicks.Goodhart's law: when the measure becomes the target, it ceases to be a good measure.

The Reward Hypothesis

Sutton's reward hypothesis states: all goals and purposes can be thought of as the maximization of the expected value of the cumulative sum of a scalar reward signal. This is a foundational assumption of RL, not a proven fact.

The hypothesis is powerful because it unifies diverse objectives under a single framework. But it places enormous weight on the reward function. A misspecified reward is not a bug in the algorithm; it is a bug in the problem definition. The algorithm will faithfully maximize whatever you give it.

Reward Shaping

Definition

Reward Shaping

Reward shaping adds a supplementary reward F(s,a,s)F(s, a, s') to the environment reward:

r(s,a,s)=r(s,a)+F(s,a,s)r'(s, a, s') = r(s, a) + F(s, a, s')

The goal is to speed up learning by providing denser feedback (e.g., rewarding an agent for getting closer to a goal, not just for reaching it). The danger: an arbitrary shaping function FF can change the optimal policy.

Definition

Potential-Based Shaping Function

A potential-based shaping function has the form:

F(s,a,s)=γΦ(s)Φ(s)F(s, a, s') = \gamma \Phi(s') - \Phi(s)

where Φ:SR\Phi: \mathcal{S} \to \mathbb{R} is a real-valued potential function on states and γ\gamma is the discount factor. This specific form is the only class of shaping functions that preserves the optimal policy.

Main Theorems

Theorem

Potential-Based Shaping Preserves Optimal Policy

Statement

Let M=(S,A,P,R,γ)M = (\mathcal{S}, \mathcal{A}, P, R, \gamma) be an MDP and let M=(S,A,P,R,γ)M' = (\mathcal{S}, \mathcal{A}, P, R', \gamma) be the shaped MDP with R(s,a,s)=R(s,a)+γΦ(s)Φ(s)R'(s, a, s') = R(s, a) + \gamma \Phi(s') - \Phi(s) for some bounded function Φ:SR\Phi: \mathcal{S} \to \mathbb{R}. Then:

  1. The optimal policy of MM' is the same as the optimal policy of MM
  2. The shaped value function satisfies Vπ(s)=Vπ(s)Φ(s)V'^{\pi}(s) = V^{\pi}(s) - \Phi(s) for every policy π\pi
  3. The shaped Q-function satisfies Qπ(s,a)=Qπ(s,a)Φ(s)Q'^{\pi}(s, a) = Q^{\pi}(s, a) - \Phi(s) for every policy π\pi

Moreover, if FF is any shaping function (not necessarily potential-based) that preserves the optimal policy for all MDPs with the same state-action space, then FF must be potential-based.

Intuition

The potential function Φ\Phi adds a "height" to each state. Moving from a low-potential state to a high-potential state gets a shaping bonus; moving the other way gets a penalty. Over any complete trajectory, the shaping rewards telescope:

t=0Tγt[γΦ(st+1)Φ(st)]=γT+1Φ(sT+1)Φ(s0)\sum_{t=0}^{T} \gamma^t [\gamma \Phi(s_{t+1}) - \Phi(s_t)] = \gamma^{T+1} \Phi(s_{T+1}) - \Phi(s_0)

The Φ(s0)-\Phi(s_0) term is constant for all trajectories starting from s0s_0, and the γT+1Φ(sT+1)\gamma^{T+1} \Phi(s_{T+1}) term vanishes as TT \to \infty (since γ<1\gamma < 1). So the total shaping reward is (approximately) the same for all trajectories, and the ranking of policies is unchanged. The key: the potential-based form ensures telescoping. An arbitrary shaping function does not telescope and can create spurious incentives.

Proof Sketch

For any policy π\pi and starting state ss:

Vπ(s)=Eπ[t=0γt[R(st,at)+γΦ(st+1)Φ(st)]  |  s0=s]V'^{\pi}(s) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t [R(s_t, a_t) + \gamma\Phi(s_{t+1}) - \Phi(s_t)] \;\middle|\; s_0 = s\right]

Split the sum:

=Vπ(s)+Eπ[t=0γt[γΦ(st+1)Φ(st)]  |  s0=s]= V^{\pi}(s) + \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t [\gamma\Phi(s_{t+1}) - \Phi(s_t)] \;\middle|\; s_0 = s\right]

The shaping sum telescopes: t=0Tγt[γΦ(st+1)Φ(st)]=γT+1Φ(sT+1)Φ(s0)\sum_{t=0}^{T} \gamma^t[\gamma\Phi(s_{t+1}) - \Phi(s_t)] = \gamma^{T+1}\Phi(s_{T+1}) - \Phi(s_0). As TT \to \infty, the first term vanishes (bounded Φ\Phi and γ<1\gamma < 1), leaving Φ(s0)-\Phi(s_0). Therefore Vπ(s)=Vπ(s)Φ(s)V'^{\pi}(s) = V^{\pi}(s) - \Phi(s) for all π\pi, so the policy ranking is unchanged.

For the necessity direction (only potential-based shaping works universally): Ng, Harada, and Russell (1999) construct specific MDPs where any non-potential-based shaping changes the optimal policy.

Why It Matters

This theorem tells you exactly what reward modifications are "safe" (preserving the optimal policy) and which are dangerous. If you want to add a heuristic reward to speed up learning (e.g., reward closeness to goal), you must express it as a potential difference. Any other form risks changing what the agent actually optimizes.

The theorem also explains why naive reward shaping so often goes wrong. Adding a flat bonus for visiting certain states, or penalizing certain actions directly, is not potential-based and can create policies that collect shaping reward instead of solving the actual task.

Failure Mode

The theorem assumes the discount factor γ<1\gamma < 1. In the undiscounted case (γ=1\gamma = 1), the telescoping argument fails because γT+1Φ(sT+1)\gamma^{T+1}\Phi(s_{T+1}) does not vanish. For episodic tasks with γ=1\gamma = 1, potential-based shaping requires Φ(s)=0\Phi(s) = 0 for all terminal states.

The theorem also says nothing about learning speed. Potential-based shaping preserves the optimal policy but can make learning faster or slower depending on the choice of Φ\Phi. A poor potential (e.g., one that points away from the goal) preserves optimality but misleads exploration.

Proposition

Goodhart's Law for Reward Models

Statement

Let π\pi^* be the optimal policy under true reward RR and let π^\hat{\pi} be the optimal policy under proxy reward R^\hat{R}. If R^(s,a)R(s,a)ϵ|\hat{R}(s,a) - R(s,a)| \leq \epsilon for all (s,a)(s,a), then:

Vπ(s)Vπ^(s)2ϵ1γV^{\pi^*}(s) - V^{\hat{\pi}}(s) \leq \frac{2\epsilon}{1 - \gamma}

where VV denotes the value function under the true reward RR.

Intuition

The bound says that a proxy with per-step error ϵ\epsilon can cause the agent to lose at most 2ϵ/(1γ)2\epsilon/(1 - \gamma) in true value. This seems reassuring until you realize that ϵ\epsilon is the worst-case error. In practice, the proxy may be accurate on typical state-action pairs but systematically wrong on the unusual state-action pairs that an optimizer specifically seeks out. An agent that optimizes the proxy hard enough will find the states where the proxy disagrees most with the true reward, and exploit them. This is the formal version of Goodhart's law: a measure that becomes a target ceases to be a good measure.

Proof Sketch

Vπ^(s)V^{\hat{\pi}}(s) evaluated under true reward RR is at least VR^π^(s)ϵ/(1γ)V^{\hat{\pi}}_{\hat{R}}(s) - \epsilon/(1 - \gamma) (the value under the proxy minus the cumulative approximation error). Since π^\hat{\pi} is optimal under R^\hat{R}, VR^π^(s)VR^π(s)Vπ(s)ϵ/(1γ)V^{\hat{\pi}}_{\hat{R}}(s) \geq V^{\pi^*}_{\hat{R}}(s) \geq V^{\pi^*}(s) - \epsilon/(1 - \gamma). Combining gives Vπ^(s)Vπ(s)2ϵ/(1γ)V^{\hat{\pi}}(s) \geq V^{\pi^*}(s) - 2\epsilon/(1-\gamma).

Why It Matters

This provides a formal justification for the intuition that optimizing a proxy too hard is dangerous. The bound is proportional to 1/(1γ)1/(1 - \gamma): the longer the effective horizon, the more damage a misspecified reward can do. This is directly relevant to RLHF, where the reward model is a learned proxy. Over-optimizing against the reward model (high KL divergence from the base policy) can produce outputs that score well on the proxy but are low quality by human judgment.

Specification Gaming

Specification gaming occurs when an agent achieves high reward through unintended behavior. Documented examples include:

  • CoastRunners (OpenAI): a boat racing agent learned to drive in circles collecting power-ups, catching fire, and crashing, because the score rewarded turbo boosts more than race completion.
  • Block stacking (OpenAI): a robot hand trained to stack blocks learned to flip the bottom block on top of the gripped block (exploiting that "stacking" was measured by the relative height of the blocks).
  • Lego grasping: a robot trained to grasp a Lego brick learned to slide it to the edge of the table where it could be pinched against the rim, rather than picking it up.
  • Evolution simulations: organisms evolved to be tall by exploiting physics engine bugs that allowed them to jitter at high frequency and gain height.

In each case, the agent found the shortest path to high reward, which was not the path the designer intended. The reward function was technically correct (the agent did maximize it) but semantically wrong (the designer wanted something different).

The Alignment Connection

Reward misspecification in RL is a microcosm of the alignment problem in AI:

Outer alignment: does the reward function RR capture the designer's true objective? This is the reward design problem. The potential-based shaping theorem shows the constraints on safe reward modification. Specification gaming shows what happens when the reward is even slightly off.

Inner alignment: does the learned agent actually optimize RR, or has it learned a proxy objective during training that happens to correlate with RR on the training distribution? A neural network trained with RL may develop internal objectives that diverge from the reward signal in novel situations.

RLHF attempts to address outer alignment by learning the reward from human preferences rather than hand-specifying it. But the learned reward model is itself a proxy, subject to all the problems described above. The KL penalty in RLHF (constraining the policy to stay close to the base model) is a practical response to Goodhart's law: it limits how hard the agent can optimize the proxy.

Common Confusions

Watch Out

Reward shaping is not the same as reward engineering

Reward shaping (technically) means adding a supplementary reward to the environment reward, where the goal is to speed up learning without changing the optimal policy. Reward engineering is the broader task of designing the reward function from scratch. The potential-based shaping theorem applies only to shaping, not to the initial design.

Watch Out

Potential-based shaping preserves the optimal policy but not the learning dynamics

Two MDPs with the same optimal policy can have very different learning curves. A good potential function (e.g., Φ(s)=d(s,goal)\Phi(s) = -d(s, \text{goal})) provides dense reward near the goal and speeds up learning. A bad potential function (e.g., Φ(s)=d(s,goal)\Phi(s) = d(s, \text{goal})) provides anti-guidance and slows learning, even though the optimal policy is the same.

Watch Out

Goodhart's law is not about noise in the reward

Random noise in the reward signal averages out and has little effect on the optimal policy. Goodhart's law is about systematic bias: the proxy reward R^\hat{R} disagrees with the true reward RR in a structured way that an optimizer can exploit. The danger is not that the proxy is noisy, but that it is wrong in a direction the optimizer can push on.

Watch Out

Sparse reward is not always bad

Sparse rewards (e.g., +1 for reaching the goal, 0 otherwise) are hard for exploration but do not suffer from specification gaming. If you give +1 only for the exact intended outcome, the agent cannot hack the reward. The problem with sparse rewards is credit assignment and exploration, not misspecification. Dense rewards are easier to learn from but harder to specify correctly.

Key Takeaways

  • The reward function is the specification of the task; misspecification leads to unintended behavior
  • Potential-based shaping (F=γΦ(s)Φ(s)F = \gamma \Phi(s') - \Phi(s)) is the only form of reward modification that universally preserves the optimal policy
  • Specification gaming is a predictable consequence of reward optimization, not a bug
  • Goodhart's law: optimizing a proxy hard enough will exploit the proxy-true reward gap
  • RLHF uses a learned reward model (proxy) with KL regularization to limit over-optimization
  • Sparse rewards avoid specification gaming but create hard exploration problems
  • The reward design problem is a concrete, mathematical version of the alignment problem

Exercises

ExerciseCore

Problem

A maze agent receives reward +1+1 for reaching the goal and 00 otherwise. You want to add shaping to encourage the agent to move toward the goal. Propose a potential function Φ(s)\Phi(s) and verify that the resulting shaping function F(s,a,s)=γΦ(s)Φ(s)F(s, a, s') = \gamma \Phi(s') - \Phi(s) gives positive reward for moving closer to the goal.

ExerciseCore

Problem

Consider a non-potential-based shaping function F(s,a,s)=+1F(s, a, s') = +1 for all transitions (a constant bonus). Show that this can change the optimal policy by constructing a specific MDP where it does.

ExerciseAdvanced

Problem

In RLHF, the objective is maxπEsd[R^(s,π(s))]βDKL(ππref)\max_\pi \mathbb{E}_{s \sim d}[\hat{R}(s, \pi(s))] - \beta \, D_{\text{KL}}(\pi \| \pi_{\text{ref}}) where R^\hat{R} is the learned reward model and πref\pi_{\text{ref}} is the reference policy. Explain why the KL term acts as a defense against Goodhart's law. What happens as β0\beta \to 0 and β\beta \to \infty?

References

Canonical:

  • Ng, Harada, Russell, "Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping" (ICML 1999)
  • Sutton & Barto, Reinforcement Learning: An Introduction (2018), Chapter 17 (Frontiers)

Specification Gaming:

  • Krakovna et al., "Specification Gaming: the Flip Side of AI Ingenuity" (DeepMind Blog, 2020)
  • Amodei et al., "Concrete Problems in AI Safety" (2016), Section 3

RLHF and Alignment:

  • Gao et al., "Scaling Laws for Reward Model Overoptimization" (2023)
  • Ziegler et al., "Fine-Tuning Language Models from Human Preferences" (2019)
  • Goodhart, C.A.E., "Problems of Monetary Management: The UK Experience" (1975), the original statement

Next Topics

  • Reward hacking: deeper treatment of exploitation behaviors and mitigation strategies
  • RLHF deep dive: how reward models are trained and deployed in language model alignment

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics