Reward Design and Reward Misspecification

Sneiderman, Robby

RL Theory

Reward Design and Reward Misspecification

The hardest problem in RL: specifying what you want. Reward shaping, potential-based shaping theorem, specification gaming, Goodhart's law in RL, and the bridge from classic RL to alignment.

AdvancedTier 1CurrentCore spine~40 min

Prerequisites

Markov Decision Processes Bellman Equations Reinforcement Learning for Drug Discovery

Start 8-question practice · 1 available Prereq Map

Learning position

Read this page in the graph.

rl-theory | layer 3 | tier 1. This page has 3 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Reward Hacking

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

The reward function is the specification language of RL. Everything the agent optimizes traces back to the reward signal. If the reward function accurately captures the designer's intent, the agent will (given sufficient capacity and exploration) learn the intended behavior. If the reward function is even slightly misspecified, the agent will find and exploit the gap.

This is not a hypothetical concern. OpenAI's CoastRunners boat learned to drive in circles collecting power-ups instead of finishing the race, because the score (the reward proxy) rewarded power-ups more than race completion. DeepMind's agents learned to exploit physics glitches in simulated environments. Language models fine-tuned with RLHF can learn to produce outputs that score well with the reward model while being unhelpful or dishonest.

Reward design connects classical RL theory to AI alignment. The same mathematical structure that allows reward shaping to speed up learning also shows why reward hacking is hard to prevent.

Prerequisites

This page assumes familiarity with MDPs and Bellman equations. You should understand optimal policies, value functions, and the relationship between reward and behavior.

The Reward Hypothesis

Sutton's reward hypothesis states: all goals and purposes can be thought of as the maximization of the expected value of the cumulative sum of a scalar reward signal. This is a foundational assumption of RL, not a proven fact.

The hypothesis is powerful because it unifies diverse objectives under a single framework. But it places enormous weight on the reward function. A misspecified reward is not a bug in the algorithm; it is a bug in the problem definition. The algorithm will faithfully maximize whatever you give it.

Reward Shaping

Definition

Reward Shaping

Reward shaping adds a supplementary reward $F(s, a, s')$ to the environment reward:

$r'(s, a, s') = r(s, a) + F(s, a, s')$

The goal is to speed up learning by providing denser feedback (e.g., rewarding an agent for getting closer to a goal, not just for reaching it). The danger: an arbitrary shaping function $F$ can change the optimal policy.

Definition

Potential-Based Shaping Function

A potential-based shaping function has the form:

$F(s, a, s') = \gamma \Phi(s') - \Phi(s)$

where $\Phi: \mathcal{S} \to \mathbb{R}$ is a real-valued potential function on states and $\gamma$ is the discount factor. This specific form is the only class of shaping functions that preserves the optimal policy.

Main Theorems

Theorem

Potential-Based Shaping Preserves Optimal Policy

Statement

Let $M = (\mathcal{S}, \mathcal{A}, P, R, \gamma)$ be an MDP and let $M' = (\mathcal{S}, \mathcal{A}, P, R', \gamma)$ be the shaped MDP with $R'(s, a, s') = R(s, a) + \gamma \Phi(s') - \Phi(s)$ for some bounded function $\Phi: \mathcal{S} \to \mathbb{R}$ . Then:

The optimal policy of $M'$ is the same as the optimal policy of $M$
The shaped value function satisfies $V'^{\pi}(s) = V^{\pi}(s) - \Phi(s)$ for every policy $\pi$
The shaped Q-function satisfies $Q'^{\pi}(s, a) = Q^{\pi}(s, a) - \Phi(s)$ for every policy $\pi$

Moreover, if $F$ is any shaping function (not necessarily potential-based) that preserves the optimal policy for all MDPs with the same state-action space, then $F$ must be potential-based.

Intuition

The potential function $\Phi$ adds a "height" to each state. Moving from a low-potential state to a high-potential state gets a shaping bonus; moving the other way gets a penalty. Over any complete trajectory, the shaping rewards telescope:

$\sum_{t=0}^{T} \gamma^t [\gamma \Phi(s_{t+1}) - \Phi(s_t)] = \gamma^{T+1} \Phi(s_{T+1}) - \Phi(s_0)$

The $-\Phi(s_0)$ term is constant for all trajectories starting from $s_0$ , and the $\gamma^{T+1} \Phi(s_{T+1})$ term vanishes as $T \to \infty$ (since $\gamma < 1$ ). So the total shaping reward is (approximately) the same for all trajectories, and the ranking of policies is unchanged. The key: the potential-based form ensures telescoping. An arbitrary shaping function does not telescope and can create spurious incentives.

Proof Sketch

For any policy $\pi$ and starting state $s$ :

$V'^{\pi}(s) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t [R(s_t, a_t) + \gamma\Phi(s_{t+1}) - \Phi(s_t)] \;\middle|\; s_0 = s\right]$

Split the sum:

$= V^{\pi}(s) + \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t [\gamma\Phi(s_{t+1}) - \Phi(s_t)] \;\middle|\; s_0 = s\right]$

The shaping sum telescopes: $\sum_{t=0}^{T} \gamma^t[\gamma\Phi(s_{t+1}) - \Phi(s_t)] = \gamma^{T+1}\Phi(s_{T+1}) - \Phi(s_0)$ . As $T \to \infty$ , the first term vanishes (bounded $\Phi$ and $\gamma < 1$ ), leaving $-\Phi(s_0)$ . Therefore $V'^{\pi}(s) = V^{\pi}(s) - \Phi(s)$ for all $\pi$ , so the policy ranking is unchanged.

For the necessity direction (only potential-based shaping works universally): Ng, Harada, and Russell (1999) construct specific MDPs where any non-potential-based shaping changes the optimal policy.

Why It Matters

This theorem tells you exactly what reward modifications are "safe" (preserving the optimal policy) and which are dangerous. If you want to add a heuristic reward to speed up learning (e.g., reward closeness to goal), you must express it as a potential difference. Any other form risks changing what the agent actually optimizes.

The theorem also explains why naive reward shaping so often goes wrong. Adding a flat bonus for visiting certain states, or penalizing certain actions directly, is not potential-based and can create policies that collect shaping reward instead of solving the actual task.

Failure Mode

The theorem assumes the discount factor $\gamma < 1$ . In the undiscounted case ( $\gamma = 1$ ), the telescoping argument leaves a residual $\Phi(s_{T+1}) - \Phi(s_0)$ that depends on the terminal state reached. The minimal sufficient condition is that $\Phi(s_T)$ is constant across all reachable terminal states (so the residual is the same for every completed trajectory and cannot change rankings); $\Phi(s) = 0$ on terminals is the conventional choice that satisfies this. Nonconstant terminal potentials can change the optimal policy in episodic tasks.

The theorem also says nothing about learning speed. Potential-based shaping preserves the optimal policy but can make learning faster or slower depending on the choice of $\Phi$ . A poor potential (e.g., one that points away from the goal) preserves optimality but misleads exploration.

report a correction →

Proposition

Goodhart's Law for Reward Models

Statement

Let $\pi^*$ be the optimal policy under true reward $R$ and let $\hat{\pi}$ be the optimal policy under proxy reward $\hat{R}$ . If $|\hat{R}(s,a) - R(s,a)| \leq \epsilon$ for all $(s,a)$ , then:

$V^{\pi^*}(s) - V^{\hat{\pi}}(s) \leq \frac{2\epsilon}{1 - \gamma}$

where $V$ denotes the value function under the true reward $R$ .

Intuition

The bound says that a proxy with per-step error $\epsilon$ can cause the agent to lose at most $2\epsilon/(1 - \gamma)$ in true value. This seems reassuring until you realize that $\epsilon$ is the worst-case error. In practice, the proxy may be accurate on typical state-action pairs but systematically wrong on the unusual state-action pairs that an optimizer specifically seeks out. An agent that optimizes the proxy hard enough will find the states where the proxy disagrees most with the true reward, and exploit them. This is the formal version of Goodhart's law: a measure that becomes a target ceases to be a good measure.

Proof Sketch

$V^{\hat{\pi}}(s)$ evaluated under true reward $R$ is at least $V^{\hat{\pi}}_{\hat{R}}(s) - \epsilon/(1 - \gamma)$ (the value under the proxy minus the cumulative approximation error). Since $\hat{\pi}$ is optimal under $\hat{R}$ , $V^{\hat{\pi}}_{\hat{R}}(s) \geq V^{\pi^*}_{\hat{R}}(s) \geq V^{\pi^*}(s) - \epsilon/(1 - \gamma)$ . Combining gives $V^{\hat{\pi}}(s) \geq V^{\pi^*}(s) - 2\epsilon/(1-\gamma)$ .

Why It Matters

This provides a formal justification for the intuition that optimizing a proxy too hard is dangerous. The bound is proportional to $1/(1 - \gamma)$ : the longer the effective horizon, the more damage a misspecified reward can do. This is directly relevant to RLHF, where the reward model is a learned proxy. Over-optimizing against the reward model (high KL divergence from the base policy) can produce outputs that score well on the proxy but are low quality by human judgment.

report a correction →

Specification Gaming

Specification gaming occurs when an agent achieves high reward through unintended behavior. Documented examples include:

CoastRunners (OpenAI): a boat racing agent learned to drive in circles collecting power-ups, catching fire, and crashing, because the score rewarded turbo boosts more than race completion.
Block stacking (OpenAI): a robot hand trained to stack blocks learned to flip the bottom block on top of the gripped block (exploiting that "stacking" was measured by the relative height of the blocks).
Lego grasping: a robot trained to grasp a Lego brick learned to slide it to the edge of the table where it could be pinched against the rim, rather than picking it up.
Evolution simulations: organisms evolved to be tall by exploiting physics engine bugs that allowed them to jitter at high frequency and gain height.

In each case, the agent found the shortest path to high reward, which was not the path the designer intended. The reward function was technically correct (the agent did maximize it) but semantically wrong (the designer wanted something different).

The Alignment Connection

Reward misspecification in RL is a microcosm of the alignment problem in AI:

Outer alignment: does the reward function $R$ capture the designer's true objective? This is the reward design problem. The potential-based shaping theorem shows the constraints on safe reward modification. Specification gaming shows what happens when the reward is even slightly off.

Inner alignment: does the learned agent actually optimize $R$ , or has it learned a proxy objective during training that happens to correlate with $R$ on the training distribution? A neural network trained with RL may develop internal objectives that diverge from the reward signal in novel situations.

RLHF attempts to address outer alignment by learning the reward from human preferences rather than hand-specifying it. But the learned reward model is itself a proxy, subject to all the problems described above. The KL penalty in RLHF (constraining the policy to stay close to the base model) is a practical response to Goodhart's law: it limits how hard the agent can optimize the proxy.

Common Confusions

Watch Out

Reward shaping is not the same as reward engineering

Reward shaping (technically) means adding a supplementary reward to the environment reward, where the goal is to speed up learning without changing the optimal policy. Reward engineering is the broader task of designing the reward function from scratch. The potential-based shaping theorem applies only to shaping, not to the initial design.

Watch Out

Potential-based shaping preserves the optimal policy but not the learning dynamics

Two MDPs with the same optimal policy can have very different learning curves. A good potential function (e.g., $\Phi(s) = -d(s, \text{goal})$ ) provides dense reward near the goal and speeds up learning. A bad potential function (e.g., $\Phi(s) = d(s, \text{goal})$ ) provides anti-guidance and slows learning, even though the optimal policy is the same.

Watch Out

Goodhart's law is not about noise in the reward

Random noise in the reward signal averages out and has little effect on the optimal policy. Goodhart's law is about systematic bias: the proxy reward $\hat{R}$ disagrees with the true reward $R$ in a structured way that an optimizer can exploit. The danger is not that the proxy is noisy, but that it is wrong in a direction the optimizer can push on.

Watch Out

Sparse reward is not always bad

Sparse rewards (e.g., +1 for reaching the goal, 0 otherwise) are hard for exploration but do not suffer from specification gaming. If you give +1 only for the exact intended outcome, the agent cannot hack the reward. The problem with sparse rewards is credit assignment and exploration, not misspecification. Dense rewards are easier to learn from but harder to specify correctly.

Summary

The reward function is the specification of the task; misspecification leads to unintended behavior
Potential-based shaping ( $F = \gamma \Phi(s') - \Phi(s)$ ) is the only form of reward modification that universally preserves the optimal policy
Specification gaming is a predictable consequence of reward optimization, not a bug
Goodhart's law: optimizing a proxy hard enough will exploit the proxy-true reward gap
RLHF uses a learned reward model (proxy) with KL regularization to limit over-optimization
Sparse rewards avoid specification gaming but create hard exploration problems
The reward design problem is a concrete, mathematical version of the alignment problem

Exercises

ExerciseCore

Problem

A maze agent receives reward $+1$ for reaching the goal and $0$ otherwise. You want to add shaping to encourage the agent to move toward the goal. Propose a potential function $\Phi(s)$ and verify that the resulting shaping function $F(s, a, s') = \gamma \Phi(s') - \Phi(s)$ gives positive reward for moving closer to the goal.

ExerciseCore

Problem

Consider a non-potential-based shaping function $F(s, a, s') = +1$ for all transitions (a constant bonus). Show that this can change the optimal policy by constructing a specific MDP where it does.

ExerciseAdvanced

Problem

In RLHF, the objective is $\max_\pi \mathbb{E}_{s \sim d}[\hat{R}(s, \pi(s))] - \beta \, D_{\text{KL}}(\pi \| \pi_{\text{ref}})$ where $\hat{R}$ is the learned reward model and $\pi_{\text{ref}}$ is the reference policy. Explain why the KL term acts as a defense against Goodhart's law. What happens as $\beta \to 0$ and $\beta \to \infty$ ?

References

Canonical:

Ng, Harada, Russell, "Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping" (ICML 1999)
Sutton & Barto, Reinforcement Learning: An Introduction (2018), Chapter 17 (Frontiers)

Specification Gaming:

Krakovna et al., "Specification Gaming: the Flip Side of AI Ingenuity" (DeepMind Blog, 2020)
Amodei et al., "Concrete Problems in AI Safety" (2016), Section 3

RLHF and Alignment:

Gao et al., "Scaling Laws for Reward Model Overoptimization" (2023)
Ziegler et al., "Fine-Tuning Language Models from Human Preferences" (2019)
Goodhart, C.A.E., "Problems of Monetary Management: The UK Experience" (1975), the original statement

Next Topics

Reward hacking: deeper treatment of exploitation behaviors and mitigation strategies
RLHF deep dive: how reward models are trained and deployed in language model alignment

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Bellman Equationslayer 2 · tier 1
Markov Decision Processeslayer 2 · tier 1
Reinforcement Learning for Drug Discoverylayer 4 · tier 3

Derived topics

2

Reinforcement Learning from Human Feedbacklayer 5 · tier 1
Reward Hackinglayer 5 · tier 2

Graph-backed continuations

Reward Hacking Reinforcement Learning from Human Feedback