Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Chain-of-Thought and Reasoning

Chain-of-thought prompting, why intermediate reasoning steps improve LLM performance, self-consistency, tree-of-thought, and the connection to inference-time compute scaling.

AdvancedTier 2Frontier~50 min

Why This Matters

Standard prompting asks a model to produce an answer directly. For simple factual questions, this works. For multi-step reasoning (arithmetic, logic, planning), direct prompting often fails. CoT builds on top of prompt engineering and in-context learning techniques. Chain-of-thought (CoT) prompting inserts intermediate reasoning steps between the question and the answer. This simple change dramatically improves performance on tasks requiring composition, arithmetic, or logical deduction.

CoT is also the foundation of test-time compute scaling: instead of making the model larger, you let it "think longer" at inference time. This idea has become central to modern LLM deployment.

Chain-of-Thought Prompting

Definition

Chain-of-Thought Prompting

A prompting strategy where the model is instructed (or given few-shot examples) to produce intermediate reasoning steps before the final answer. The prompt includes examples of the form: question, step-by-step reasoning, answer.

Wei et al. (2022) showed that adding "Let's think step by step" or providing few-shot exemplars with reasoning chains improved accuracy on GSM8K (math word problems) from ~18% to ~57% for PaLM 540B.

Why it works (mechanistic view). Each intermediate token conditions the next generation step. Without CoT, the model must compute the answer in a single forward pass, which means the "reasoning" is limited to the depth of the network. With CoT, the model can perform serial computation across multiple token-generation steps. Each step is a forward pass, so kk reasoning tokens give the model kk additional "layers" of serial computation.

Proposition

CoT as Problem Decomposition

Statement

If a task TT can be decomposed into kk subtasks T1,,TkT_1, \ldots, T_k where each TiT_i is solvable by the model in a single forward pass, then CoT prompting that elicits solutions to T1,,TkT_1, \ldots, T_k sequentially enables the model to solve TT by conditioning each step on the outputs of previous steps. Without CoT, the model must solve the composed task in a single forward pass, which may exceed its computational capacity.

Intuition

A transformer has fixed depth. Each forward pass computes a function of bounded complexity. Multi-step problems exceed this complexity. CoT converts a deep computation into a sequence of shallow computations, where each step's output is written to the context window and read by subsequent steps. The context window acts as an external memory or scratchpad.

Proof Sketch

Feng et al. (2023) formalized this: constant-depth transformers cannot solve problems requiring ω(1)\omega(1) sequential computation steps (e.g., composition of kk functions). However, with kk intermediate generation steps, each step is a constant-depth computation, and the full chain composes kk such steps. This is analogous to the difference between TC0\text{TC}^0 (constant-depth circuits) and P\text{P} (polynomial-time computation).

Why It Matters

This explains why CoT helps specifically on compositional tasks (multi-step math, multi-hop reasoning) but provides little benefit on tasks that are already within single-step capacity (sentiment classification, factual recall). The benefit scales with the number of reasoning steps required.

Failure Mode

CoT only helps if the model can execute each individual step correctly. If a single step requires knowledge or reasoning beyond the model's capacity, the chain fails. Errors also compound: an error in step ii propagates to all subsequent steps. For kk steps with per-step error probability pp, the probability of a fully correct chain is approximately (1p)k(1-p)^k.

Self-Consistency

Proposition

Self-Consistency Improves Over Single CoT

Statement

Sample mm independent chain-of-thought reasoning paths from the model using temperature τ>0\tau > 0. Extract the final answer from each chain and take the majority vote. If each chain has probability p>1/2p > 1/2 of reaching the correct answer and chains are approximately independent, the majority vote accuracy approaches 1 as mm \to \infty at a rate given by Hoeffding's inequality (a standard concentration inequality):

P(majority incorrect)exp(2m(p1/2)2)P(\text{majority incorrect}) \leq \exp(-2m(p - 1/2)^2)

Intuition

Different reasoning paths may make different errors. If errors are uncorrelated, the majority vote cancels them. This is an ensemble method applied to reasoning chains rather than models.

Proof Sketch

Each chain produces a Bernoulli random variable: correct with probability pp, incorrect with probability 1p1-p. The majority vote is correct if more than m/2m/2 chains are correct. By Hoeffding's inequality applied to the sum of these Bernoulli variables, the probability of the majority being wrong decays exponentially in mm.

Why It Matters

Wang et al. (2023) showed self-consistency (SC) improves GSM8K accuracy from ~57% (single CoT) to ~74% with just 40 samples. The cost is linear in mm (you run mm forward passes), but no retraining is needed. SC is the simplest form of test-time compute scaling.

Failure Mode

The assumption p>1/2p > 1/2 is critical. If the model is more likely to reach a wrong answer than the right one, majority vote amplifies the error. Also, chains are not truly independent (they share the same model weights and prompt), so the effective mm is smaller than the nominal sample count. Correlated errors reduce the benefit of voting.

Tree of Thought

Tree of thought (ToT) generalizes CoT from a single linear chain to a tree of reasoning paths. At each step, the model generates multiple candidate continuations, evaluates them (using the model itself as an evaluator), and explores the most promising branches.

Definition

Tree of Thought

A reasoning framework where the model explores multiple reasoning paths organized as a tree. At each node, the model generates bb candidate next steps. A value function (often the model itself prompted to evaluate) scores each candidate. Search proceeds via breadth-first search, depth-first search, or beam search over the tree.

ToT explicitly trades more inference compute for better answers. For tasks requiring planning or backtracking (e.g., Game of 24, crossword puzzles), ToT significantly outperforms linear CoT because linear CoT commits to a single path and cannot recover from early mistakes.

Connection to Test-Time Compute

CoT, self-consistency, and ToT are all instances of a broader principle: inference-time scaling. Instead of making the model bigger (which increases both training and inference cost), you spend more compute at inference time.

The key insight: for many tasks, doubling inference compute (via more CoT steps, more samples, or tree search) provides larger gains than doubling model size. This is because reasoning tasks often require serial depth, which the model lacks, rather than more parameters.

OpenAI's o1 and o3 models (2024-2025) formalized this into a training paradigm: train the model to use chain-of-thought reasoning internally and scale the number of reasoning tokens at inference time. The result is a model that improves on hard problems by "thinking longer," with performance scaling as a function of inference compute.

Common Confusions

Watch Out

CoT is not just showing work

The intermediate steps are not cosmetic. They provide genuine computational benefit by allowing the model to store and condition on intermediate results. A model forced to produce the answer token immediately after the question cannot access this serial computation pathway.

Watch Out

CoT does not make the model smarter

CoT does not add new knowledge or capabilities. It allows the model to use existing capabilities more effectively on multi-step problems. If the model cannot perform a single reasoning step correctly, CoT will not fix it.

Watch Out

Self-consistency requires the task to have a unique answer

Majority voting assumes answers can be compared for equality. For open-ended generation (summarization, creative writing), there is no single correct answer to vote on. Self-consistency is specific to tasks with verifiable discrete answers.

Exercises

ExerciseCore

Problem

A model answers a math problem correctly with probability p=0.6p = 0.6 using a single CoT chain. If you sample m=15m = 15 independent chains and take the majority vote, what is the upper bound on the probability of an incorrect majority answer (using Hoeffding's bound)?

ExerciseAdvanced

Problem

Explain why a constant-depth transformer cannot compute the composition of kk arbitrary functions fkf1(x)f_k \circ \cdots \circ f_1(x) for unbounded kk, but a transformer with CoT (producing kk intermediate tokens) can. What computational complexity class corresponds to each case?

References

Canonical:

  • Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (NeurIPS 2022)
  • Wang et al., "Self-Consistency Improves Chain of Thought Reasoning in Language Models" (ICLR 2023)

Current:

  • Yao et al., "Tree of Thoughts: Deliberate Problem Solving with Large Language Models" (NeurIPS 2023)
  • Feng et al., "Towards Revealing the Mystery behind Chain of Thought" (NeurIPS 2023)

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This