Applied ML
Reward Systems and Reinforcement Learning Neuroscience
Dopamine as TD error, distributional RL as a model of dopamine variability, actor-critic mapped to cortex-striatum, and the model-based versus model-free dichotomy in human choice.
Why This Matters
Reinforcement learning is one of the few places where machine learning and neuroscience genuinely converge on the same equations. The Schultz-Dayan-Montague paper (1997, Science 275(5306)) showed that midbrain dopamine neurons encode a temporal-difference error signal almost exactly matching the that Sutton and Barto had proposed on engineering grounds. Two decades later, Dabney et al. (2020, Nature 577(7792)) showed that the heterogeneity of dopamine responses fits distributional RL better than scalar TD. The math the ML community uses to train agents keeps showing up in mouse and primate recordings.
For ML readers this matters in two directions. It is empirical evidence that TD-style algorithms are computationally natural enough that biology rediscovered them. It also gives an outside-view check on RL design choices: if a proposed architecture has no plausible biological substrate, that is not a fatal flaw, but if a biologically observed mechanism (distributional codes, model-based arbitration) consistently outperforms scalar alternatives in animals, it is a hint about which inductive biases pay off.
Core Ideas
Dopamine as TD error. Schultz and colleagues recorded ventral tegmental area neurons in monkeys during classical conditioning. Before learning, dopamine cells fired on reward delivery. After learning, they fired on the predictive cue and went silent on expected reward; an omitted expected reward caused a phasic dip. This three-way pattern (positive on unexpected reward, zero on expected reward, negative on omitted reward) is the signature of a reward prediction error, exactly from TD learning.
Distributional RL in the brain. Dabney et al. (2020) noted that dopamine cells differ in their reversal points: some respond positively to small rewards, others only to large ones. Interpreted as a population of value predictors with different optimism levels, the population encodes a distribution over returns rather than its mean. This is the same architecture as C51 and QR-DQN. The behavioral signature (asymmetric updating from positive versus negative surprises) shows up in both mice and the trained agents.
Actor-critic mapping. A common functional map: dorsolateral striatum implements the actor (action selection), ventral striatum and orbitofrontal cortex implement the critic (state value), midbrain dopamine carries the TD error that updates both. The cortex-basal-ganglia-thalamus loop closes this in vivo. The mapping is a simplification: parallel loops handle different action sets and timescales, and many circuits do not fit cleanly. As a first-order picture it is useful.
Model-based versus model-free in humans. Daw, Niv, and Dayan (2005, Nat. Neurosci. 8(12)) proposed that the brain runs both a model-based system (forward simulation through a learned transition model, slow but flexible) and a model-free system (cached values updated by TD, fast but inflexible), with arbitration weighted by relative uncertainty. The two-step task (Daw et al. 2011, Neuron 69) operationalized this and showed that human choices typically reflect a mixture, with the model-based weight rising under reflection and falling under stress, time pressure, or in some psychiatric conditions.
Common Confusions
Dopamine encodes pleasure or wanting. Older accounts treated dopamine as the brain's reward signal in a hedonic sense. The TD-error account is sharper: dopamine encodes the change in expected future reward, not reward itself. A fully predicted reward delivers no dopamine response even though it remains pleasant.
Model-based RL is just better than model-free. In animals and humans, neither system dominates. Model-free is faster and more sample-efficient on stationary tasks; model-based wins when the environment changes or when an action's downstream consequences differ from the cached value. Arbitration based on uncertainty is the empirically supported pattern, and it carries over to ML hybrids like Dyna.
References
Related Topics
Last reviewed: April 18, 2026
Prerequisites
Foundations this topic depends on.
- Temporal Difference LearningLayer 2
- Markov Decision ProcessesLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Value Iteration and Policy IterationLayer 2
- Policy Gradient TheoremLayer 3