LLM Construction
Chain-of-Thought and Reasoning
Chain-of-thought prompting, why intermediate reasoning steps improve LLM performance, self-consistency, tree-of-thought, and the connection to inference-time compute scaling.
Prerequisites
Why This Matters
Standard prompting asks a model to produce an answer directly. For simple factual questions, this works. For multi-step reasoning (arithmetic, logic, planning), direct prompting often fails. CoT builds on top of prompt engineering and in-context learning techniques. Chain-of-thought (CoT) prompting inserts intermediate reasoning steps between the question and the answer. This simple change dramatically improves performance on tasks requiring composition, arithmetic, or logical deduction.
CoT is also the foundation of test-time compute scaling: instead of making the model larger, you let it "think longer" at inference time. This idea has become central to modern LLM deployment.
Chain-of-Thought Prompting
Chain-of-Thought Prompting
A prompting strategy where the model is instructed (or given few-shot examples) to produce intermediate reasoning steps before the final answer. The prompt includes examples of the form: question, step-by-step reasoning, answer.
Wei et al. (2022) showed that adding "Let's think step by step" or providing few-shot exemplars with reasoning chains improved accuracy on GSM8K (math word problems) from ~18% to ~57% for PaLM 540B.
Why it works (mechanistic view). Each intermediate token conditions the next generation step. Without CoT, the model must compute the answer in a single forward pass, which means the "reasoning" is limited to the depth of the network. With CoT, the model can perform serial computation across multiple token-generation steps. Each step is a forward pass, so reasoning tokens give the model additional "layers" of serial computation.
CoT as Problem Decomposition
Statement
If a task can be decomposed into subtasks where each is solvable by the model in a single forward pass, then CoT prompting that elicits solutions to sequentially enables the model to solve by conditioning each step on the outputs of previous steps. Without CoT, the model must solve the composed task in a single forward pass, which may exceed its computational capacity.
Intuition
A transformer has fixed depth. Each forward pass computes a function of bounded complexity. Multi-step problems exceed this complexity. CoT converts a deep computation into a sequence of shallow computations, where each step's output is written to the context window and read by subsequent steps. The context window acts as an external memory or scratchpad.
Proof Sketch
Feng et al. (2023) formalized this: constant-depth transformers cannot solve problems requiring sequential computation steps (e.g., composition of functions). However, with intermediate generation steps, each step is a constant-depth computation, and the full chain composes such steps. This is analogous to the difference between (constant-depth circuits) and (polynomial-time computation).
Why It Matters
This explains why CoT helps specifically on compositional tasks (multi-step math, multi-hop reasoning) but provides little benefit on tasks that are already within single-step capacity (sentiment classification, factual recall). The benefit scales with the number of reasoning steps required.
Failure Mode
CoT only helps if the model can execute each individual step correctly. If a single step requires knowledge or reasoning beyond the model's capacity, the chain fails. Errors also compound: an error in step propagates to all subsequent steps. For steps with per-step error probability , the probability of a fully correct chain is approximately .
Self-Consistency
Self-Consistency Improves Over Single CoT
Statement
Sample independent chain-of-thought reasoning paths from the model using temperature . Extract the final answer from each chain and take the majority vote. If each chain has probability of reaching the correct answer and chains are approximately independent, the majority vote accuracy approaches 1 as at a rate given by Hoeffding's inequality (a standard concentration inequality):
Intuition
Different reasoning paths may make different errors. If errors are uncorrelated, the majority vote cancels them. This is an ensemble method applied to reasoning chains rather than models.
Proof Sketch
Each chain produces a Bernoulli random variable: correct with probability , incorrect with probability . The majority vote is correct if more than chains are correct. By Hoeffding's inequality applied to the sum of these Bernoulli variables, the probability of the majority being wrong decays exponentially in .
Why It Matters
Wang et al. (2023) showed self-consistency (SC) improves GSM8K accuracy from ~57% (single CoT) to ~74% with just 40 samples. The cost is linear in (you run forward passes), but no retraining is needed. SC is the simplest form of test-time compute scaling.
Failure Mode
The assumption is critical. If the model is more likely to reach a wrong answer than the right one, majority vote amplifies the error. Also, chains are not truly independent (they share the same model weights and prompt), so the effective is smaller than the nominal sample count. Correlated errors reduce the benefit of voting.
Tree of Thought
Tree of thought (ToT) generalizes CoT from a single linear chain to a tree of reasoning paths. At each step, the model generates multiple candidate continuations, evaluates them (using the model itself as an evaluator), and explores the most promising branches.
Tree of Thought
A reasoning framework where the model explores multiple reasoning paths organized as a tree. At each node, the model generates candidate next steps. A value function (often the model itself prompted to evaluate) scores each candidate. Search proceeds via breadth-first search, depth-first search, or beam search over the tree.
ToT explicitly trades more inference compute for better answers. For tasks requiring planning or backtracking (e.g., Game of 24, crossword puzzles), ToT significantly outperforms linear CoT because linear CoT commits to a single path and cannot recover from early mistakes.
Connection to Test-Time Compute
CoT, self-consistency, and ToT are all instances of a broader principle: inference-time scaling. Instead of making the model bigger (which increases both training and inference cost), you spend more compute at inference time.
The key insight: for many tasks, doubling inference compute (via more CoT steps, more samples, or tree search) provides larger gains than doubling model size. This is because reasoning tasks often require serial depth, which the model lacks, rather than more parameters.
OpenAI's o1 and o3 models (2024-2025) formalized this into a training paradigm: train the model to use chain-of-thought reasoning internally and scale the number of reasoning tokens at inference time. The result is a model that improves on hard problems by "thinking longer," with performance scaling as a function of inference compute.
Common Confusions
CoT is not just showing work
The intermediate steps are not cosmetic. They provide genuine computational benefit by allowing the model to store and condition on intermediate results. A model forced to produce the answer token immediately after the question cannot access this serial computation pathway.
CoT does not make the model smarter
CoT does not add new knowledge or capabilities. It allows the model to use existing capabilities more effectively on multi-step problems. If the model cannot perform a single reasoning step correctly, CoT will not fix it.
Self-consistency requires the task to have a unique answer
Majority voting assumes answers can be compared for equality. For open-ended generation (summarization, creative writing), there is no single correct answer to vote on. Self-consistency is specific to tasks with verifiable discrete answers.
Exercises
Problem
A model answers a math problem correctly with probability using a single CoT chain. If you sample independent chains and take the majority vote, what is the upper bound on the probability of an incorrect majority answer (using Hoeffding's bound)?
Problem
Explain why a constant-depth transformer cannot compute the composition of arbitrary functions for unbounded , but a transformer with CoT (producing intermediate tokens) can. What computational complexity class corresponds to each case?
References
Canonical:
- Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (NeurIPS 2022)
- Wang et al., "Self-Consistency Improves Chain of Thought Reasoning in Language Models" (ICLR 2023)
Current:
- Yao et al., "Tree of Thoughts: Deliberate Problem Solving with Large Language Models" (NeurIPS 2023)
- Feng et al., "Towards Revealing the Mystery behind Chain of Thought" (NeurIPS 2023)
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Prompt Engineering and In-Context LearningLayer 5
- Transformer ArchitectureLayer 4
- Attention Mechanism TheoryLayer 4
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Softmax and Numerical StabilityLayer 1
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
Builds on This
- Tool-Augmented ReasoningLayer 5