Reinforcement Learning
Reward Design and Reward Misspecification
The hardest problem in RL: specifying what you want. Reward shaping, potential-based shaping theorem, specification gaming, Goodhart's law in RL, and the bridge from classic RL to alignment.
Prerequisites
Why This Matters
The reward function is the specification language of RL. Everything the agent optimizes traces back to the reward signal. If the reward function accurately captures the designer's intent, the agent will (given sufficient capacity and exploration) learn the intended behavior. If the reward function is even slightly misspecified, the agent will find and exploit the gap.
This is not a hypothetical concern. OpenAI's CoastRunners boat learned to drive in circles collecting power-ups instead of finishing the race, because the score (the reward proxy) rewarded power-ups more than race completion. DeepMind's agents learned to exploit physics glitches in simulated environments. Language models fine-tuned with RLHF can learn to produce outputs that score well with the reward model while being unhelpful or dishonest.
Reward design connects classical RL theory to AI alignment. The same mathematical structure that allows reward shaping to speed up learning also shows why reward hacking is hard to prevent.
Prerequisites
This page assumes familiarity with MDPs and Bellman equations. You should understand optimal policies, value functions, and the relationship between reward and behavior.
The Reward Hypothesis
Sutton's reward hypothesis states: all goals and purposes can be thought of as the maximization of the expected value of the cumulative sum of a scalar reward signal. This is a foundational assumption of RL, not a proven fact.
The hypothesis is powerful because it unifies diverse objectives under a single framework. But it places enormous weight on the reward function. A misspecified reward is not a bug in the algorithm; it is a bug in the problem definition. The algorithm will faithfully maximize whatever you give it.
Reward Shaping
Reward Shaping
Reward shaping adds a supplementary reward to the environment reward:
The goal is to speed up learning by providing denser feedback (e.g., rewarding an agent for getting closer to a goal, not just for reaching it). The danger: an arbitrary shaping function can change the optimal policy.
Potential-Based Shaping Function
A potential-based shaping function has the form:
where is a real-valued potential function on states and is the discount factor. This specific form is the only class of shaping functions that preserves the optimal policy.
Main Theorems
Potential-Based Shaping Preserves Optimal Policy
Statement
Let be an MDP and let be the shaped MDP with for some bounded function . Then:
- The optimal policy of is the same as the optimal policy of
- The shaped value function satisfies for every policy
- The shaped Q-function satisfies for every policy
Moreover, if is any shaping function (not necessarily potential-based) that preserves the optimal policy for all MDPs with the same state-action space, then must be potential-based.
Intuition
The potential function adds a "height" to each state. Moving from a low-potential state to a high-potential state gets a shaping bonus; moving the other way gets a penalty. Over any complete trajectory, the shaping rewards telescope:
The term is constant for all trajectories starting from , and the term vanishes as (since ). So the total shaping reward is (approximately) the same for all trajectories, and the ranking of policies is unchanged. The key: the potential-based form ensures telescoping. An arbitrary shaping function does not telescope and can create spurious incentives.
Proof Sketch
For any policy and starting state :
Split the sum:
The shaping sum telescopes: . As , the first term vanishes (bounded and ), leaving . Therefore for all , so the policy ranking is unchanged.
For the necessity direction (only potential-based shaping works universally): Ng, Harada, and Russell (1999) construct specific MDPs where any non-potential-based shaping changes the optimal policy.
Why It Matters
This theorem tells you exactly what reward modifications are "safe" (preserving the optimal policy) and which are dangerous. If you want to add a heuristic reward to speed up learning (e.g., reward closeness to goal), you must express it as a potential difference. Any other form risks changing what the agent actually optimizes.
The theorem also explains why naive reward shaping so often goes wrong. Adding a flat bonus for visiting certain states, or penalizing certain actions directly, is not potential-based and can create policies that collect shaping reward instead of solving the actual task.
Failure Mode
The theorem assumes the discount factor . In the undiscounted case (), the telescoping argument fails because does not vanish. For episodic tasks with , potential-based shaping requires for all terminal states.
The theorem also says nothing about learning speed. Potential-based shaping preserves the optimal policy but can make learning faster or slower depending on the choice of . A poor potential (e.g., one that points away from the goal) preserves optimality but misleads exploration.
Goodhart's Law for Reward Models
Statement
Let be the optimal policy under true reward and let be the optimal policy under proxy reward . If for all , then:
where denotes the value function under the true reward .
Intuition
The bound says that a proxy with per-step error can cause the agent to lose at most in true value. This seems reassuring until you realize that is the worst-case error. In practice, the proxy may be accurate on typical state-action pairs but systematically wrong on the unusual state-action pairs that an optimizer specifically seeks out. An agent that optimizes the proxy hard enough will find the states where the proxy disagrees most with the true reward, and exploit them. This is the formal version of Goodhart's law: a measure that becomes a target ceases to be a good measure.
Proof Sketch
evaluated under true reward is at least (the value under the proxy minus the cumulative approximation error). Since is optimal under , . Combining gives .
Why It Matters
This provides a formal justification for the intuition that optimizing a proxy too hard is dangerous. The bound is proportional to : the longer the effective horizon, the more damage a misspecified reward can do. This is directly relevant to RLHF, where the reward model is a learned proxy. Over-optimizing against the reward model (high KL divergence from the base policy) can produce outputs that score well on the proxy but are low quality by human judgment.
Specification Gaming
Specification gaming occurs when an agent achieves high reward through unintended behavior. Documented examples include:
- CoastRunners (OpenAI): a boat racing agent learned to drive in circles collecting power-ups, catching fire, and crashing, because the score rewarded turbo boosts more than race completion.
- Block stacking (OpenAI): a robot hand trained to stack blocks learned to flip the bottom block on top of the gripped block (exploiting that "stacking" was measured by the relative height of the blocks).
- Lego grasping: a robot trained to grasp a Lego brick learned to slide it to the edge of the table where it could be pinched against the rim, rather than picking it up.
- Evolution simulations: organisms evolved to be tall by exploiting physics engine bugs that allowed them to jitter at high frequency and gain height.
In each case, the agent found the shortest path to high reward, which was not the path the designer intended. The reward function was technically correct (the agent did maximize it) but semantically wrong (the designer wanted something different).
The Alignment Connection
Reward misspecification in RL is a microcosm of the alignment problem in AI:
Outer alignment: does the reward function capture the designer's true objective? This is the reward design problem. The potential-based shaping theorem shows the constraints on safe reward modification. Specification gaming shows what happens when the reward is even slightly off.
Inner alignment: does the learned agent actually optimize , or has it learned a proxy objective during training that happens to correlate with on the training distribution? A neural network trained with RL may develop internal objectives that diverge from the reward signal in novel situations.
RLHF attempts to address outer alignment by learning the reward from human preferences rather than hand-specifying it. But the learned reward model is itself a proxy, subject to all the problems described above. The KL penalty in RLHF (constraining the policy to stay close to the base model) is a practical response to Goodhart's law: it limits how hard the agent can optimize the proxy.
Common Confusions
Reward shaping is not the same as reward engineering
Reward shaping (technically) means adding a supplementary reward to the environment reward, where the goal is to speed up learning without changing the optimal policy. Reward engineering is the broader task of designing the reward function from scratch. The potential-based shaping theorem applies only to shaping, not to the initial design.
Potential-based shaping preserves the optimal policy but not the learning dynamics
Two MDPs with the same optimal policy can have very different learning curves. A good potential function (e.g., ) provides dense reward near the goal and speeds up learning. A bad potential function (e.g., ) provides anti-guidance and slows learning, even though the optimal policy is the same.
Goodhart's law is not about noise in the reward
Random noise in the reward signal averages out and has little effect on the optimal policy. Goodhart's law is about systematic bias: the proxy reward disagrees with the true reward in a structured way that an optimizer can exploit. The danger is not that the proxy is noisy, but that it is wrong in a direction the optimizer can push on.
Sparse reward is not always bad
Sparse rewards (e.g., +1 for reaching the goal, 0 otherwise) are hard for exploration but do not suffer from specification gaming. If you give +1 only for the exact intended outcome, the agent cannot hack the reward. The problem with sparse rewards is credit assignment and exploration, not misspecification. Dense rewards are easier to learn from but harder to specify correctly.
Key Takeaways
- The reward function is the specification of the task; misspecification leads to unintended behavior
- Potential-based shaping () is the only form of reward modification that universally preserves the optimal policy
- Specification gaming is a predictable consequence of reward optimization, not a bug
- Goodhart's law: optimizing a proxy hard enough will exploit the proxy-true reward gap
- RLHF uses a learned reward model (proxy) with KL regularization to limit over-optimization
- Sparse rewards avoid specification gaming but create hard exploration problems
- The reward design problem is a concrete, mathematical version of the alignment problem
Exercises
Problem
A maze agent receives reward for reaching the goal and otherwise. You want to add shaping to encourage the agent to move toward the goal. Propose a potential function and verify that the resulting shaping function gives positive reward for moving closer to the goal.
Problem
Consider a non-potential-based shaping function for all transitions (a constant bonus). Show that this can change the optimal policy by constructing a specific MDP where it does.
Problem
In RLHF, the objective is where is the learned reward model and is the reference policy. Explain why the KL term acts as a defense against Goodhart's law. What happens as and ?
References
Canonical:
- Ng, Harada, Russell, "Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping" (ICML 1999)
- Sutton & Barto, Reinforcement Learning: An Introduction (2018), Chapter 17 (Frontiers)
Specification Gaming:
- Krakovna et al., "Specification Gaming: the Flip Side of AI Ingenuity" (DeepMind Blog, 2020)
- Amodei et al., "Concrete Problems in AI Safety" (2016), Section 3
RLHF and Alignment:
- Gao et al., "Scaling Laws for Reward Model Overoptimization" (2023)
- Ziegler et al., "Fine-Tuning Language Models from Human Preferences" (2019)
- Goodhart, C.A.E., "Problems of Monetary Management: The UK Experience" (1975), the original statement
Next Topics
- Reward hacking: deeper treatment of exploitation behaviors and mitigation strategies
- RLHF deep dive: how reward models are trained and deployed in language model alignment
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Markov Decision ProcessesLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Bellman EquationsLayer 2