Applied ML
Predictive Coding and Autoencoders in the Brain
Hierarchical predictive coding (Rao-Ballard) and the free-energy principle as biological analogs of amortized variational inference and approximate backprop.
Prerequisites
Why This Matters
Cortex is metabolically expensive and bandwidth-limited. Predictive coding proposes that, rather than transmitting raw activations upward, each cortical area sends only the residual its higher-area model failed to predict. Top-down projections carry predictions, bottom-up projections carry prediction errors, and learning reduces error. This single architectural commitment yields a sharp algorithmic story: perception is approximate inference, and learning is gradient descent on a free-energy bound.
For ML readers the link is direct. The Rao-Ballard hierarchy is structurally a stacked autoencoder with feedback connections. The free-energy principle is the variational ELBO with biological branding. Whittington and Bogacz showed that, under specific assumptions, predictive-coding updates approximate backprop arbitrarily well using only local Hebbian-like rules. Whether the cortex actually implements any of this is contested, but the math is shared with the generative models we already train.
Core Ideas
Rao and Ballard (1999, Nat. Neurosci. 2(1)). Each layer maintains a state and predicts the layer below via a generative weight matrix : . The prediction error is . State updates and weights minimize a sum of squared errors weighted by precision (inverse variance) at each level:
Their model trained on natural-image patches reproduced extra-classical receptive field effects (end-stopping, contextual modulation) without putting them in by hand.
Free-energy principle (Friston, 2010, Nat. Rev. Neurosci. 11(2)). Reframes predictive coding as variational inference on a generative model. The brain maintains a recognition density and minimizes variational free energy , which is an upper bound on surprise and equivalent to the negative ELBO. Action also minimizes free energy, giving a unified account of perception, learning, and behavior. The unification is conceptually elegant; many specific predictions are difficult to falsify.
Connection to amortized variational inference. A VAE encoder amortizes the cost of inferring across data points. The Rao-Ballard hierarchy plays the same role with a fixed iterative inference procedure (a few steps of state updates per stimulus) rather than a learned encoder. Both schemes optimize the same free-energy objective; they differ in how inference is implemented.
Whittington and Bogacz (2017, Neural Comput. 29(5)). Construct a predictive-coding network in which top-down predictions and bottom-up errors evolve to a fixed point, then synaptic updates use only the locally available error and activity at each synapse. Under linear or near-linear regimes, the learned weight changes converge to backprop weight changes. This makes predictive coding the most concrete biologically plausible approximation to backprop currently on the table.
Common Confusions
Free energy in the brain is the same quantity as physical free energy. It is not. Friston's free energy is the variational free energy from Bayesian statistics, which has the same algebraic form as the Helmholtz free energy from statistical mechanics but tracks belief states, not particle ensembles. Treat the name as a useful analogy, not a thermodynamic identity.
Predictive coding has been proven to be how cortex works. It has not. Some predictions match neurophysiology (precision-weighted error responses, expectation-modulated activity), but several core claims (separate error and prediction populations, hierarchical organization of generative models) lack clean experimental confirmation. The framework is a candidate, not a settled account.
References
Related Topics
Last reviewed: April 18, 2026
Prerequisites
Foundations this topic depends on.
- AutoencodersLayer 2
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Eigenvalues and EigenvectorsLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Vectors, Matrices, and Linear MapsLayer 0A
- Variational AutoencodersLayer 3
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- KL DivergenceLayer 1
- Information Theory FoundationsLayer 0B