Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Foundations

Markov Chains and Steady State

Markov chains: the Markov property, transition matrices, stationary distributions, irreducibility, aperiodicity, the ergodic theorem, and mixing time. The backbone of PageRank, MCMC, and reinforcement learning.

CoreTier 2Stable~50 min

Why This Matters

Markov chains are the probabilistic backbone of MCMC sampling, reinforcement learning, and web search. Every Metropolis-Hastings sampler is a Markov chain designed to converge to a target distribution. Every MDP under a fixed policy reduces to a Markov chain. PageRank computes the stationary distribution of a Markov chain on the web graph.

Mental Model

A system has states. At each time step, it jumps to a new state with probabilities that depend only on the current state, not on how it got there. This memorylessness is the Markov property. The central question: if you run the chain long enough, does the fraction of time spent in each state converge to a fixed distribution?

Formal Setup and Notation

Definition

Markov chain

A sequence of random variables X0,X1,X2,X_0, X_1, X_2, \ldots taking values in a state space S\mathcal{S} is a Markov chain if for all nn and all states s0,,sn+1s_0, \ldots, s_{n+1}:

P(Xn+1=sn+1Xn=sn,,X0=s0)=P(Xn+1=sn+1Xn=sn)P(X_{n+1} = s_{n+1} \mid X_n = s_n, \ldots, X_0 = s_0) = P(X_{n+1} = s_{n+1} \mid X_n = s_n)

The future is conditionally independent of the past given the present.

Definition

Transition matrix

For a finite state space S={1,,N}\mathcal{S} = \{1, \ldots, N\}, the transition matrix PRN×NP \in \mathbb{R}^{N \times N} has entries:

Pij=P(Xn+1=jXn=i)P_{ij} = P(X_{n+1} = j \mid X_n = i)

Each row sums to 1: jPij=1\sum_j P_{ij} = 1. Such a matrix is called stochastic. The nn-step transition probabilities are given by PnP^n.

Definition

Stationary distribution

A distribution π\pi over S\mathcal{S} is stationary for PP if:

π=πP\pi = \pi P

That is, π\pi is a left eigenvector of PP with eigenvalue 1. If the chain starts in distribution π\pi, it stays in π\pi forever.

Definition

Irreducibility

A Markov chain is irreducible if every state can be reached from every other state in a finite number of steps. Formally, for all i,ji, j, there exists n>0n > 0 such that Pijn>0P^n_{ij} > 0.

Definition

Aperiodicity

State ii has period di=gcd{n1:Piin>0}d_i = \gcd\{n \geq 1 : P^n_{ii} > 0\}. A chain is aperiodic if every state has period 1. A self-loop (Pii>0P_{ii} > 0 for some state) guarantees aperiodicity for irreducible chains.

Definition

Detailed balance (reversibility)

A distribution π\pi satisfies detailed balance with respect to PP if:

πiPij=πjPjifor all i,j\pi_i P_{ij} = \pi_j P_{ji} \quad \text{for all } i, j

Detailed balance implies stationarity (sum both sides over ii). MCMC methods construct chains satisfying detailed balance with respect to a target distribution.

Main Theorems

Theorem

Ergodic Theorem for Markov Chains

Statement

If a Markov chain on a finite state space is irreducible and aperiodic, then it has a unique stationary distribution π\pi and for any initial state ii:

limnPijn=πjfor all j\lim_{n \to \infty} P^n_{ij} = \pi_j \quad \text{for all } j

Moreover, the time-average converges: for any function ff on S\mathcal{S},

1Tt=0T1f(Xt)a.s.jπjf(j)\frac{1}{T} \sum_{t=0}^{T-1} f(X_t) \xrightarrow{a.s.} \sum_j \pi_j f(j)

Intuition

Run the chain long enough and the distribution over states converges to π\pi regardless of where you started. Time averages converge to ensemble averages. This is why MCMC works: you sample a trajectory and average over it to approximate expectations under the target distribution.

Proof Sketch

The Perron-Frobenius theorem guarantees that a positive stochastic matrix (which PnP^n becomes for large enough nn under irreducibility and aperiodicity) has a unique left eigenvector with eigenvalue 1, and all other eigenvalues have modulus strictly less than 1. The powers PnP^n therefore converge to the matrix with all rows equal to π\pi.

Why It Matters

This theorem is the theoretical foundation of MCMC. It guarantees that Metropolis-Hastings and Gibbs sampling converge to the correct distribution. It also justifies PageRank: the stationary distribution of a random walk on the web graph is the importance vector.

Failure Mode

If the chain is reducible (disconnected components), there are multiple stationary distributions. If it is periodic (e.g., a bipartite graph with period 2), the chain oscillates and never converges to a single distribution, though the time-average still converges. The ergodic theorem says nothing about how fast convergence happens. That requires mixing time analysis.

Theorem

Perron-Frobenius for Stochastic Matrices

Statement

PP has eigenvalue 1 with algebraic multiplicity 1. All other eigenvalues λ\lambda satisfy λ<1|\lambda| < 1. The left eigenvector corresponding to eigenvalue 1, normalized to sum to 1, is the unique stationary distribution π\pi.

Intuition

The largest eigenvalue controls long-term behavior. Since it is 1, the chain neither grows nor shrinks in total probability. The gap 1λ21 - |\lambda_2| between the largest and second-largest eigenvalue magnitude controls the rate of convergence.

Proof Sketch

Apply the Perron-Frobenius theorem for nonneg matrices. Irreducibility implies the matrix is irreducible as a nonneg matrix. Aperiodicity plus irreducibility implies primitivity, which gives λ2<1|\lambda_2| < 1.

Why It Matters

The spectral gap 1λ21 - |\lambda_2| is the single most important quantity for Markov chain convergence speed. A large spectral gap means fast mixing. This connects Markov chain theory to spectral graph theory.

Failure Mode

For countably infinite state spaces, eigenvalue 1 may be in the continuous spectrum and the chain can be null-recurrent (stationary distribution does not exist even though the chain is irreducible). This does not happen for finite chains.

Mixing Time

Definition

Mixing time

The mixing time is:

tmix(ϵ)=min{t:maxiPt(i,)πTVϵ}t_{\text{mix}}(\epsilon) = \min\{t : \max_i \| P^t(i, \cdot) - \pi \|_{\text{TV}} \leq \epsilon\}

where TV\| \cdot \|_{\text{TV}} is the total variation distance. Conventionally, tmix=tmix(1/4)t_{\text{mix}} = t_{\text{mix}}(1/4).

The spectral gap bounds mixing time:

11λ2log12ϵtmix(ϵ)11λ2logNϵ\frac{1}{1 - |\lambda_2|} \cdot \log\frac{1}{2\epsilon} \leq t_{\text{mix}}(\epsilon) \leq \frac{1}{1 - |\lambda_2|} \cdot \log\frac{N}{\epsilon}

where NN is the number of states. Small spectral gap implies slow mixing. Random walks on expander graphs have large spectral gaps and mix in O(logN)O(\log N) steps.

Canonical Examples

Example

Two-state chain

States {0,1}\{0, 1\} with transition matrix:

P=(1ααβ1β)P = \begin{pmatrix} 1-\alpha & \alpha \\ \beta & 1-\beta \end{pmatrix}

The stationary distribution is π=(β/(α+β),  α/(α+β))\pi = (\beta/(\alpha+\beta), \; \alpha/(\alpha+\beta)). The second eigenvalue is 1αβ1 - \alpha - \beta, so the spectral gap is α+β\alpha + \beta. When α\alpha and β\beta are both small, the chain mixes slowly because it stays in each state for a long time.

Common Confusions

Watch Out

Stationary does not mean converges to

A chain can have a stationary distribution without converging to it. A periodic chain has a stationary distribution but oscillates forever. Convergence requires both irreducibility and aperiodicity.

Watch Out

Detailed balance is sufficient, not necessary

Many chains have stationary distributions without satisfying detailed balance. Detailed balance is a convenient sufficient condition used in MCMC design, but non-reversible chains (which violate detailed balance) can mix faster.

Watch Out

Markov property is about conditional independence, not memorylessness of states

A Markov chain can revisit states and have complex long-run behavior. The Markov property only says that the transition probability depends on the current state alone. The chain absolutely "remembers" where it has been in the sense that past states influence the current state; it just does not use that memory for the next transition.

Summary

  • Markov property: the future depends only on the present state
  • Transition matrix PP is row-stochastic; π=πP\pi = \pi P defines the stationary distribution
  • Irreducibility + aperiodicity = unique stationary distribution and convergence (ergodic theorem)
  • Spectral gap 1λ21 - |\lambda_2| controls mixing speed
  • Detailed balance is the design principle behind MCMC samplers
  • MDP under a fixed policy is a Markov chain on the state space

Exercises

ExerciseCore

Problem

Compute the stationary distribution of the transition matrix:

P=(0.70.30.40.6)P = \begin{pmatrix} 0.7 & 0.3 \\ 0.4 & 0.6 \end{pmatrix}

ExerciseAdvanced

Problem

A random walk on a cycle graph CnC_n (move left or right with probability 1/21/2 each) is periodic with period 2 when nn is even. How would you modify the chain to make it aperiodic without changing the stationary distribution?

References

Canonical:

  • Levin, Peres, Wilmer, Markov Chains and Mixing Times (2009), Chapters 1-4, 12
  • Norris, Markov Chains (1997), Chapters 1-3

Current:

  • Bremaud, Markov Chains: Gibbs Fields, Monte Carlo Simulation and Queues (2020), Part I

  • Munkres, Topology (2000), Chapter 1 (set theory review)

Next Topics

From Markov chains, the natural continuations are:

  • Metropolis-Hastings: designing Markov chains with a target stationary distribution
  • Markov decision processes: adding actions and rewards to Markov chains
  • PageRank algorithm: computing the stationary distribution of the web graph

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics