Dopamine & Reward

Not pleasure itself, but its surprise — the gap between what you got and what you expected.

Suggested next → Synaptic Transmission · MIND

The brief

Dopamine has, in popular conversation, become a folk synonym for pleasure. The folk usage gets the neuroscience exactly wrong. Dopamine does not encode pleasure; it encodes prediction error — the discrepancy between what reward you got and what you expected. When reality beats expectation, dopamine spikes; when it matches, dopamine is silent; when reality falls short, dopamine drops below baseline. The breakthrough came from Wolfram Schultz's monkey experiments in the 1990s, recording from dopamine neurons while monkeys learned to associate a tone with a juice reward. The pattern Schultz found mapped exactly onto the temporal-difference learning rule that Richard Sutton and Andrew Barto had developed in artificial intelligence in the 1980s — derived independently from purely computational considerations. Reinforcement learning runs on the same equation that runs the brain's reward system.

Dopamine is a neurotransmitter synthesized from tyrosine via L-DOPA. The brain has only ~400,000 dopamine neurons, concentrated in two midbrain nuclei: the ventral tegmental area (VTA) and the substantia nigra pars compacta. The mesolimbic pathway (VTA → nucleus accumbens, prefrontal cortex) is the reward and motivation circuit; the nigrostriatal pathway is the motor learning circuit, whose degeneration produces Parkinson's. The Schultz finding (1997): dopamine neurons fire phasically when an unpredicted reward is received. Through learning, the firing shifts to the earliest reliable predictor — a tone, a light, the sight of food — and no longer fires at the reward itself. If the predicted reward is omitted, dopamine pauses at the time it should have arrived. The signal is therefore not pleasure but temporal-difference reward prediction error. Sutton and Barto's TD learning uses exactly this δ signal to update value estimates; the cortex-basal-ganglia circuit appears to implement an analogous algorithm in vivo. Liking vs. wanting (Kent Berridge, 1996): dopamine drives wanting (motivation, the pulling-toward-things) but not liking (the hedonic enjoyment, which depends on opioid systems). Animals with dopamine ablation will starve in front of food they still find pleasurable. Drugs of abuse all converge on dopamine — cocaine and amphetamines block reuptake, opioids disinhibit VTA, alcohol increases firing — producing supraphysiological signals that train the brain to value the drug above all else. The wanting-liking dissociation explains the clinical paradox: addicts report no longer enjoying the drug but being unable to stop seeking it. Parkinson's: the gradual death of nigrostriatal dopamine neurons produces tremor, rigidity, bradykinesia; L-DOPA is the standard treatment.

Why nowReinforcement learning in AI runs the same algorithm the brain runs. AlphaGo (2016), AlphaStar (2019), and game-playing systems generally use temporal-difference variants. Reinforcement learning from human feedback (RLHF) is how Claude, ChatGPT, and Gemini are aligned to user preferences; the reward model is the human's evaluations, the policy is the language model. The technological convergence — neuroscience and AI implementing the same principle — is one of the most striking results of the past three decades. Behavioral addiction (gambling, social media, gaming) is increasingly understood as exploitation of dopaminergic learning: variable-ratio reinforcement schedules are particularly addictive because they maximize prediction error. Ketamine and psychedelics produce rapid antidepressant effects, plausibly through resetting dysfunctional reward learning.