Reinforcement Learning Environments and Benchmarks

Sneiderman, Robby

RL Theory

Reinforcement Learning Environments and Benchmarks

The standard RL evaluation stack: Gymnasium API, classic control tasks, Atari, MuJoCo, ProcGen, the sim-to-real gap, and why benchmark performance is a poor predictor of real-world RL capability.

CoreTier 3CurrentSupporting~40 min

Prerequisites

Markov Decision Processes Deep RL for Control

Prereq Map

Learning position

Read this page in the graph.

rl-theory | layer 3 | tier 3. This page has 2 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

An RL algorithm is only as meaningful as the environment it is tested on. The choice of benchmark determines what properties of the algorithm are measured: sample efficiency, generalization, continuous control, long-horizon planning, or exploration. Many RL papers report strong results on benchmarks that do not transfer to real applications. Understanding the standard environments and their limitations is necessary for interpreting RL results critically.

Mental Model

An RL environment provides the MDP that the agent interacts with: states, actions, transitions, and rewards. The Gymnasium API standardizes this interface. Different environment suites test different capabilities. No single benchmark tests all the properties that matter for real-world RL.

The Gymnasium API

Definition

Gymnasium Environment Interface

The standard RL environment interface (successor to OpenAI Gym) defines:

reset() -> (observation, info): initialize the environment, return the initial observation
step(action) -> (observation, reward, terminated, truncated, info): execute an action, return the next observation, reward, and termination signals

The terminated flag indicates the episode ended due to the task (goal reached or failure). The truncated flag indicates the episode was cut off by a time limit. This distinction matters for correct value function bootstrapping.

All standard RL libraries (Stable Baselines3, CleanRL, RLlib) use this interface. Writing a custom environment means implementing reset() and step().

Classic Control Environments

CartPole: Balance a pole on a cart by applying left/right forces. State: $(x, \dot{x}, \theta, \dot{\theta})$ . Discrete actions. Solved when the agent maintains balance for 500 steps. Solvable by simple methods (even linear policies). Useful only as a sanity check.

MountainCar: Drive an underpowered car up a hill. The agent must learn to build momentum by rocking back and forth. Sparse reward (only at the goal). Tests exploration in a minimal setting. Most algorithms solve this quickly once they discover the rocking strategy.

LunarLander: Land a spacecraft on a pad by firing thrusters. Continuous state, discrete or continuous actions. Denser reward signal (shaped by distance to pad and velocity). A reasonable first test for deep RL algorithms.

These environments have low-dimensional state spaces (4-8 dimensions) and short episodes. They test whether an algorithm works at all, not whether it scales.

Atari Suite

The Arcade Learning Environment (ALE) provides ~60 Atari 2600 games. The agent receives raw pixel observations (210x160 RGB) and discrete actions (up to 18).

Atari was the benchmark that established deep RL. DQN (2015) achieved human-level performance on many games by combining Q-learning with convolutional networks and experience replay.

Properties tested: visual perception, long-horizon credit assignment, exploration (Montezuma's Revenge is notoriously hard due to sparse rewards).

Watch Out

Human-level on Atari does not mean human-like intelligence

DQN surpassed human scores on many Atari games by exploiting patterns humans do not (e.g., finding and abusing bugs in game physics). On games requiring planning and exploration (Montezuma's Revenge, Pitfall), RL agents performed far below human level for years. Aggregate scores across all 60 games hide this variance.

MuJoCo for Continuous Control

MuJoCo (Multi-Joint Dynamics with Contact) provides physics-simulated robots with continuous state and action spaces. Standard tasks:

HalfCheetah, Ant, Humanoid: locomotion. Learn to walk or run by controlling joint torques.
Reacher: move a 2-link arm to a target position.
Hopper: single-leg hopping.

State dimensions range from ~10 (Reacher) to ~400 (Humanoid). Action dimensions range from 2 to 17.

MuJoCo tests continuous control, high-dimensional action spaces, and contact dynamics. It is the standard benchmark for policy gradient and actor-critic methods.

ProcGen for Generalization

Definition

ProcGen

A suite of 16 procedurally generated game environments. Each environment generates a new level layout on every episode. The agent is evaluated on levels it has never seen during training, measuring generalization rather than memorization.

ProcGen was designed to address a critical weakness of Atari benchmarks: Atari levels are fixed, so an agent can memorize the optimal sequence of actions rather than learning a general policy. ProcGen forces the agent to learn features that transfer across level variations.

Proposition

Benchmark Specificity

Statement

For any RL algorithm $A$ and any benchmark suite $\mathcal{E}$ , there exists an alternative algorithm $A'$ that performs worse on $\mathcal{E}$ but better on a different suite $\mathcal{E}'$ of equal size. Formally, no algorithm uniformly dominates all others across all possible MDPs. Performance on a benchmark suite is a measure of the alignment between the algorithm's inductive biases and the specific properties of the environments in that suite.

Intuition

Every algorithm makes implicit assumptions (about reward structure, dynamics, observation modality). A benchmark suite tests whether those assumptions hold. An algorithm that excels on visual Atari games (where CNNs help) may fail on continuous control (where state-based MLPs suffice). Benchmark performance measures fit, not general capability.

Proof Sketch

This follows from the No Free Lunch theorem for optimization. Across all possible reward functions, the average performance of any algorithm equals that of random search. This averaging assumes a uniform (or permutation-invariant) prior over reward functions; on any structured prior the result is vacuous (Wolpert 1996, Wolpert and Macready 1997). Restricting to a specific benchmark selects a non-uniform subset of MDPs, giving some algorithms an advantage over others.

Why It Matters

This theorem is a corrective against overclaiming. When a paper reports state-of-the-art on MuJoCo locomotion, this tells you something about continuous control with dense rewards and fixed dynamics, not about RL in general.

Failure Mode

The proposition is vacuous if the "alternative suite" is adversarially constructed. In practice, the point is more nuanced: algorithms that exploit specific benchmark properties (e.g., reward shaping, fixed environments, deterministic dynamics) often fail on problems without those properties.

report a correction →

The Sim-to-Real Gap

Training in simulation and deploying in the real world exposes systematic differences:

Physics mismatch: Simulated friction, contact, and dynamics do not perfectly match reality. A policy that walks well in MuJoCo may fall immediately on a real robot.
Observation mismatch: Rendered images differ from camera images (lighting, texture, noise).
Action mismatch: Simulated actuators respond instantly; real motors have delays and backlash.

Domain randomization partially addresses this: randomize simulation parameters (friction, mass, visual appearance) during training so the policy is robust to the specific values encountered in reality. This works when the real-world parameters fall within the randomization range.

System identification fits simulation parameters to match real-world data. This gives a more accurate simulator but is brittle to distribution shift.

Neither approach fully closes the gap. Real-world RL remains expensive and sample-inefficient, which is why simulation-based benchmarks dominate the literature.

Why RL Benchmark Wins Are Often Misleading

Watch Out

Hyperparameter tuning on the test environment is overfitting

Many RL papers tune hyperparameters (learning rate, network architecture, reward scaling) extensively on the same environments they report results on. This is analogous to tuning on the test set in supervised learning. The resulting numbers overestimate the algorithm's generality.

Watch Out

Aggregate scores hide per-environment variance

Reporting median or mean human-normalized scores across Atari games compresses 60 diverse results into one number. An algorithm that scores 10,000% on easy games and 0% on hard games looks good in aggregate. Per-environment breakdowns reveal the actual strengths and weaknesses.

Watch Out

Sample efficiency is benchmark-dependent

An algorithm that is sample-efficient on CartPole (simple dynamics, small state space) may be extremely sample-inefficient on a real robotics task. Sample efficiency numbers from one domain do not transfer to another. The constant factors and scaling behavior differ.

Summary

The Gymnasium API standardizes the environment interface with reset() and step()
Classic control (CartPole, MountainCar) is for sanity checks, not serious evaluation
Atari tests visual RL but allows memorization of fixed levels
MuJoCo tests continuous control with dense rewards
ProcGen tests generalization via procedural generation
No single benchmark tests all properties that matter for real-world RL
The sim-to-real gap means simulation results overestimate real-world performance
Aggregate benchmark scores hide per-environment variance and tuning-on-test effects

Exercises

ExerciseCore

Problem

A Gymnasium environment returns terminated=False, truncated=True at step 1000. Should the value function bootstrap from the next state, or should the return be computed as if the episode ended? Why does this distinction matter?

ExerciseAdvanced

Problem

An RL algorithm achieves a median human-normalized score of 200% on the Atari-57 benchmark. Explain why this number alone is insufficient to claim the algorithm is "superhuman" at playing Atari games. What additional information would you need?

References

Canonical:

Brockman et al., "OpenAI Gym" (arXiv 2016)
Bellemare et al., "The Arcade Learning Environment" (JAIR 2013)

Current:

Todorov, Erez, Tassa, "MuJoCo: A Physics Engine for Model-Based Control" (IROS 2012)
Cobbe et al., "Leveraging Procedural Generation to Benchmark Reinforcement Learning" (ProcGen, ICML 2020)
Yu et al., "Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning" (CoRL 2020). The standard meta-RL/multi-task manipulation suite (50 tasks, ML10 / ML45 splits).
Küttler et al., "The NetHack Learning Environment" (NeurIPS 2020), arXiv:2006.13760. Long-horizon, partially-observed, procedurally generated environment that current RL agents still cannot solve.
Hafner, "Benchmarking the Spectrum of Agent Capabilities (Crafter)" (arXiv:2109.06780, ICLR 2022). Compact 2D survival game with 22 explicit achievement axes; the current standard for unit-tested generalist RL.
Kaiser et al., "Model-Based Reinforcement Learning for Atari (Atari 100k)" (ICLR 2020), arXiv:1903.00374. Defines the 100k-environment-step regime that has become the dominant sample-efficiency benchmark on Atari.
Team et al. (DeepMind), "Open-Ended Learning Leads to Generally Capable Agents (XLand)" (arXiv:2107.12808, 2021). Massive procedural multi-agent task space used to study generalist RL.

Frontier (LLM-agent benchmarks):

Mialon et al., "GAIA: a benchmark for General AI Assistants" (arXiv:2311.12983, 2024). Real-world tool-use benchmark stratified by reasoning depth.
Liu et al., "AgentBench: Evaluating LLMs as Agents" (ICLR 2024), arXiv:2308.03688. Eight environments spanning OS, DB, web shopping, code, etc.
Jimenez et al., "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" (ICLR 2024), arXiv:2310.06770. The benchmark that now anchors most LLM coding-agent comparisons.

No Free Lunch:

Wolpert, "The Lack of A Priori Distinctions Between Learning Algorithms" (Neural Computation, 1996)
Wolpert and Macready, "No Free Lunch Theorems for Optimization" (IEEE Transactions on Evolutionary Computation, 1997)

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Markov Decision Processeslayer 2 · tier 1
Deep RL for Controllayer 4 · tier 3

Derived topics

0

No published topic currently declares this as a prerequisite.