Chain-of-Thought and Reasoning

Sneiderman, Robby

LLM Construction

Chain-of-Thought and Reasoning

Chain-of-thought as serial computation in token-space: the Merrill-Sabharwal expressiveness hierarchy, the Lanham-Turpin faithfulness gap, self-consistency bounds, hidden CoT, and the post-o1/R1 reframing where CoT is the artifact of RL on verifiable rewards.

AdvancedTier 1FrontierCore spine~75 min

Prerequisites

Prompt Engineering and in Context Learning Transformer Architecture Decoding Strategies Scaling Laws

Start 8-question practice · 3 available Prereq Map

Learning position

Read this page in the graph.

llm-construction | layer 5 | tier 1. This page has 4 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

DPO vs GRPO vs RL for Reasoning

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

A standard transformer is a fixed-depth circuit. With $L$ layers, every output token is a function of the input that uses at most $L$ stages of serial composition, regardless of how complicated the underlying problem is. That ceiling is the reason direct prompting fails on multi-step problems even when the model "knows" each individual step.

Six-panel infographic on chain-of-thought and reasoning: (1) the core idea of serial computation in token space, with direct answering vs step-by-step inference; (2) expressiveness, where longer traces enable more compute, with the Merrill-Sabharwal 2023 result that polynomial CoT length implies P-uniform circuits; (3) self-consistency, sampling multiple reasoning paths and majority-voting; (4) the faithfulness gap, where explanations may not be the true causal reason for an answer; (5) hidden vs visible reasoning, with internal latent reasoning beyond tokens; (6) the post-o1/R1 view that CoT is a verifiable artifact when reinforcement learning rewards correct answers. — Chain-of-thought turns one forward pass into a serial workspace, but better reasoning traces are not automatically faithful explanations.

Standard prompting vs chain-of-thought; GSM8K accuracy on PaLM 540B (Wei et al. 2022)

Chain-of-thought (CoT) prompting breaks the ceiling by writing intermediate tokens to the context window. Each generated token is a fresh forward pass conditioned on every prior token, so a $k$ -token chain is roughly $k \cdot L$ stages of serial computation. Merrill and Sabharwal (2023) proved this informal picture: with enough CoT steps a transformer recognizes exactly the polynomial-time languages.

Three things follow that this page treats as first-class:

Expressiveness. A formal hierarchy ( $\text{TC}^0$ without CoT, regular languages with linear CoT, $\text{P}$ with polynomial CoT) tells you what tasks CoT can possibly help on.
Faithfulness. The text the model writes is not necessarily the computation the model runs. Lanham et al. (2023) and Turpin et al. (2023) measured the gap, with consequences for safety and interpretability.
Reframing after o1 and R1. Post-2024 reasoning models (o1, DeepSeek R1) reach long-CoT behavior through RL on verifiable rewards, not through prompt engineering. The page ends on what that changes and what stays the same.

Chain-of-Thought Prompting

Definition

Chain-of-Thought Prompting

A prompting strategy where the model is instructed (or given few-shot examples) to produce intermediate reasoning steps before the final answer. The prompt includes examples of the form: question, step-by-step reasoning, answer.

Wei et al. (2022) showed that adding "Let's think step by step" or providing few-shot exemplars with reasoning chains improved accuracy on GSM8K (math word problems) from ~18% to ~57% for PaLM 540B. Kojima et al. (2022) showed that even the bare zero-shot trigger phrase elicits structured reasoning in sufficiently large models. Below roughly 60-100B parameters the trigger frequently produces fluent-but-wrong chains, an early empirical hint that CoT depends on a base of in-context-learning competence.

Why it works (mechanistic view). Each intermediate token conditions the next generation step. Without CoT, the model must compute the answer in a single forward pass, which means the "reasoning" is limited to the depth of the network. With CoT, the model can perform serial computation across multiple token-generation steps. Each step is a forward pass, so $k$ reasoning tokens give the model $k$ additional "layers" of serial computation. The next section makes this precise as a complexity-class statement.

Proposition

CoT as Problem Decomposition

Statement

If a task $T$ can be decomposed into $k$ subtasks $T_1, \ldots, T_k$ where each $T_i$ is solvable by the model in a single forward pass, then CoT prompting that elicits solutions to $T_1, \ldots, T_k$ sequentially enables the model to solve $T$ by conditioning each step on the outputs of previous steps. Without CoT, the model must solve the composed task in a single forward pass, which may exceed its computational capacity.

Intuition

A transformer has fixed depth. Each forward pass computes a function of bounded complexity. Multi-step problems exceed this complexity. CoT converts a deep computation into a sequence of shallow computations, where each step's output is written to the context window and read by subsequent steps. The context window acts as an external memory or scratchpad.

Proof Sketch

Feng et al. (2023) formalized this: constant-depth transformers cannot solve problems requiring $\omega(1)$ sequential computation steps (e.g., composition of $k$ functions). However, with $k$ intermediate generation steps, each step is a constant-depth computation, and the full chain composes $k$ such steps. This is analogous to the difference between $\text{TC}^0$ (constant-depth circuits) and $\text{P}$ (polynomial-time computation).

Why It Matters

This explains why CoT helps specifically on compositional tasks (multi-step math, multi-hop reasoning) but provides little benefit on tasks that are already within single-step capacity (sentiment classification, factual recall). The benefit scales with the number of reasoning steps required.

Failure Mode

CoT only helps if the model can execute each individual step correctly. If a single step requires knowledge or reasoning beyond the model's capacity, the chain fails. Errors also compound: an error in step $i$ propagates to all subsequent steps. For $k$ steps with per-step error probability $p$ , and assuming errors at different steps are independent, the probability of a fully correct chain is $(1-p)^k$ . In practice the independence assumption fails: errors cascade (a wrong intermediate claim biases later reasoning), so the true probability of a correct chain is typically lower than the product form suggests.

report a correction →

The Expressiveness Hierarchy

The proposition above is informal about "computational capacity." Merrill and Sabharwal (2023, 2024) made it sharp. Their model is a transformer decoder with bounded-precision attention; "no CoT" means decoding the answer in a fixed number of output tokens, " $T(n)$ CoT" means producing up to $T(n)$ intermediate tokens before the answer on inputs of length $n$ .

Theorem

CoT Expressiveness Hierarchy (Merrill-Sabharwal)

Statement

Let $\mathsf{TD}[T(n)]$ denote the class of formal languages decidable by a fixed-precision transformer decoder allowed $T(n)$ chain-of-thought tokens on inputs of length $n$ . Then:

No CoT. $\mathsf{TD}[O(1)] \subseteq \text{TC}^0$ . Standard transformers without intermediate generation cannot decide problems hard for $\text{NC}^1$ under standard complexity assumptions; in particular they cannot decide undirected graph connectivity, finite-state-machine simulation, or $S_5$ -word problems.
Logarithmic CoT. $\mathsf{TD}[O(\log n)] \subseteq \text{L}$ (logspace), only modestly above $\text{TC}^0$ .
Linear CoT. With projected pre-norm, $\mathsf{TD}[\Theta(n)]$ contains all regular languages and is contained in the context-sensitive languages.
Polynomial CoT. With generalized pre-norm, $\mathsf{TD}[\text{poly}(n)] = \text{P}$ . Transformer decoders with polynomially many CoT tokens recognize exactly the polynomial-time languages.

Intuition

The depth of a transformer is the number of layers $L$ , and $L$ does not grow with input length. Without intermediate generation, the model is a fixed-depth circuit. Bounded-precision transformers without CoT sit inside uniform $\text{TC}^0$ (Merrill, Sabharwal, Saunshi 2022). $\text{TC}^0$ does contain parity and (uniform) iterated multiplication, so the relevant lower bounds are problems hard for $\text{NC}^1$ (e.g., $S_5$ -word problems, regular-language membership in non-aperiodic languages, undirected graph connectivity), which $\text{TC}^0$ provably cannot decide under standard complexity assumptions. Each generated token is a new forward pass that can read every previous output, so $T(n)$ CoT tokens give the decoder a working tape of length $T(n)$ . A polynomial tape plus constant-depth transitions simulates a polynomial-time Turing machine.

Proof Sketch

The lower bounds restate prior tight characterizations: bounded-precision transformers without CoT are in uniform $\text{TC}^0$ (Merrill, Sabharwal, Saunshi 2022). The upper bound for polynomial CoT is by simulation: a $\text{P}$ machine with tape of length $\text{poly}(n)$ can be simulated by a transformer that reads its prefix-of-tape from the context window and writes the next tape symbol as the next output token; the transition function is constant-depth and depends on a constant-size window of the tape, which fits in a fixed-depth attention block. The matching lower bound (every polynomial-CoT decoder is in $\text{P}$ ) is by counting: each forward pass is polynomial-time, and there are polynomially many of them.

Why It Matters

This puts a clean ceiling and a clean floor on what CoT can do. CoT can realize anything in $\text{P}$ that is outside $\text{TC}^0$ , which includes most multi-step arithmetic, dynamic programming, planning, and graph search. CoT cannot realize anything outside $\text{P}$ : no amount of polynomial chain-length lets a transformer decide an $\text{NP}$ -hard language exactly, absent additional resources like external tools or non-determinism. It also explains the empirical pattern in Wei et al. (2022): CoT helps most on tasks that have a serial-depth bottleneck, and almost not at all on tasks already solvable in constant depth (single-token classification, lookup).

Failure Mode

Three caveats. (a) The result is about what languages can be decided by some weight setting, not what a particular pretrained model has actually learned. A transformer that could in principle solve a problem with 10 CoT tokens may need 200 in practice, or fail to find the right chain at all. (b) The $\text{P}$ characterization assumes the decoder is allowed to write arbitrary symbols; if the vocabulary is too small or the decoding deterministic, the simulation shrinks. (c) The bounded-precision assumption is essential: with unbounded precision a single transformer block already becomes Turing-complete, which is a less informative model of real systems.

report a correction →

Self-Consistency

Proposition

Self-Consistency Improves Over Single CoT

Statement

Sample $m$ independent chain-of-thought reasoning paths from the model using temperature $\tau > 0$ . Extract the final answer from each chain and take the majority vote. If each chain has probability $p > 1/2$ of reaching the correct answer and chains are conditionally independent given the prompt, the majority vote accuracy approaches 1 as $m \to \infty$ at a rate given by Hoeffding's inequality (a standard concentration inequality):

$P(\text{majority incorrect}) \leq \exp(-2m(p - 1/2)^2)$

Intuition

Different reasoning paths may make different errors. If errors are uncorrelated, the majority vote cancels them. This is an ensemble method applied to reasoning chains rather than models.

Proof Sketch

Each chain produces a Bernoulli random variable: correct with probability $p$ , incorrect with probability $1-p$ . The majority vote is correct if more than $m/2$ chains are correct. By Hoeffding's inequality applied to the sum of these Bernoulli variables, the probability of the majority being wrong decays exponentially in $m$ .

Why It Matters

Wang et al. (2023) showed self-consistency (SC) improves GSM8K accuracy from ~57% (single CoT) to ~74% with just 40 samples. The cost is linear in $m$ (you run $m$ forward passes), but no retraining is needed. SC is the simplest form of test-time compute scaling.

Failure Mode

The assumption $p > 1/2$ is critical. If the model is more likely to reach a wrong answer than the right one, majority vote amplifies the error. Also, chains are not truly independent (they share the same model weights and prompt), so the effective $m$ is smaller than the nominal sample count. Correlated errors reduce the benefit of voting.

report a correction →

Faithfulness: Does CoT Reflect Reasoning?

The expressiveness theorem says CoT can implement serial computation. The faithfulness question is empirical: does the chain a model writes actually correspond to the computation it ran? If not, then CoT is decorative for interpretability and unreliable for safety, even when it improves accuracy.

Proposition

The CoT Faithfulness Gap

Statement

Let $M$ be a model that produces a chain $c$ and an answer $a = M(x, c)$ on prompt $x$ . Define a chain faithful if (i) causality: editing $c$ changes $a$ in proportion to the edit's relevance, (ii) early commitment absent: forcing the model to commit to an answer after a truncated chain $c_{:k}$ does not match the full-chain answer except at $k$ near $|c|$ , and (iii) paraphrase invariance: rewording $c$ while preserving content does not change $a$ .

Lanham et al. (2023) tested these properties across eight reasoning benchmarks and observed three patterns:

Variable causality. Faithfulness is task- and model-dependent. AQuA and LogiQA show high CoT dependence; ARC-Easy and OpenBookQA show low dependence (the model commits to the same answer with or without the chain).
Inverse scaling. On most tasks, faithfulness decreases as the model gets larger. Larger models can solve more in a single forward pass, so the chain becomes less load-bearing.
Filler tokens do not substitute for content. Replacing the chain with meaningless padding tokens of the same length does not recover full-chain accuracy. CoT is not just "more compute in token form"; the content matters, even when the model could in principle just use extra context.

Intuition

A larger model with a deeper effective circuit can compute the answer in $O(1)$ tokens and then narrate a chain that justifies it. The narration is plausible but post-hoc. A smaller model that physically needs the chain to solve the problem cannot afford this; its written steps are closer to the actual computation. Faithfulness and capability are in tension because more capable models do more in latent space.

Proof Sketch

This is an empirical finding rather than a theorem; the support comes from controlled interventions in Lanham et al. (2023) and Turpin et al. (2023). Lanham et al. ran four interventions on each chain: (1) early answering (force an answer at token $k$ ), (2) adding mistakes (corrupt one step), (3) paraphrasing (rewrite the chain), (4) filler-token substitution. Turpin et al. introduced spurious biasing features (e.g., always-A in few-shot exemplars, demographically biased prompts) and measured whether the model verbalized them in its chain. Across 13 BIG-Bench Hard tasks, biasing features reduced GPT-3.5 and Claude 1.0 accuracy by up to ~36 percentage points while the chains almost never mentioned the bias.

Why It Matters

For interpretability and safety, this matters more than the accuracy gain. A safety case that says "we read the chain to detect dangerous reasoning" fails on any model where the chain is a post-hoc rationalization. A chain that confidently rationalizes a stereotype-aligned answer without naming the stereotype defeats the very property CoT was supposed to provide. Anthropic and OpenAI both treat unfaithful CoT as a research priority, separate from the question of whether CoT improves benchmarks.

Failure Mode

"Faithful" is not a single number. A chain can be faithful in the early-answering sense (the model genuinely needs all the tokens) and unfaithful in the bias-verbalization sense (it omits the cause of its answer). Lanham et al. found tasks where (i) holds and (ii) does not, and vice versa. A reader of a published faithfulness number should always check which intervention generated it.

report a correction →

Watch Out

Faithfulness is not the same as accuracy

A model can be more accurate because it offloads computation to the chain (faithful, content-bearing CoT) or despite the chain being post-hoc narration (unfaithful). The accuracy curve and the faithfulness curve can move in opposite directions with scale: larger Claude and GPT models have become more accurate while the chains have become less faithful on the same tasks.

Watch Out

Bias is reasoning the model does not write down

Turpin et al. (2023) showed that injecting "always answer A" into few-shot exemplars makes models systematically pick A and write justifications that never mention exemplar order. This is the same pathology as a model that absorbs a demographic stereotype from the prompt and produces a stereotyped answer with race-blind justification. The chain looks like reasoning. The mechanism is bias, and the chain hides it.

Tree of Thought

Tree of thought (ToT) generalizes CoT from a single linear chain to a tree of reasoning paths. At each step, the model generates multiple candidate continuations, evaluates them (using the model itself as an evaluator), and explores the most promising branches.

Definition

Tree of Thought

A reasoning framework where the model explores multiple reasoning paths organized as a tree. At each node, the model generates $b$ candidate next steps. A value function (often the model itself prompted to evaluate) scores each candidate. Search proceeds via breadth-first search, depth-first search, or beam search over the tree.

ToT explicitly trades more inference compute for better answers. For tasks requiring planning or backtracking (e.g., Game of 24, crossword puzzles), ToT significantly outperforms linear CoT because linear CoT commits to a single path and cannot recover from early mistakes.

From Prompted CoT to RL on Verifiable Rewards

CoT, self-consistency, and ToT are inference-time techniques applied to a model trained for next-token prediction. The class of "reasoning models" that appeared in 2024-2025 (OpenAI o1 and o3, DeepSeek R1, Gemini Deep Think, Claude with extended thinking) is not a prompting trick. They are fine-tuned with reinforcement learning so that producing a long internal chain is the behavior the policy converges to.

The training signal is what changed. RL with verifiable rewards (RLVR) replaces a learned reward model with a programmatic checker: a math problem has a correct numeric answer, a code problem has unit tests, a logic problem has a known solution. The model samples a chain, the checker scores the final answer, and an algorithm such as PPO or GRPO increases the probability of chains that lead to checker-confirmed answers. DeepSeek R1 (Guo et al. 2025) showed that this objective alone, applied to a strong base model, produces longer chains, self-correction, and "wait, let me reconsider" phrases without explicit demonstrations.

Three points are easy to confuse and worth separating:

CoT prompting is necessary but not sufficient. A non-reasoning base model with CoT prompting still benefits from intermediate steps (the expressiveness theorem still applies), but it does not learn to allocate chain length to problem difficulty.
Test-time compute scaling is real but verifier-bottlenecked. Snell et al. (2024) and Brown et al. (2024) document smooth power-law gains from sampling and search, but the gain saturates at the verifier's ceiling. RLVR works precisely because the verifier is exact.
"Thinking longer" is not a free hyperparameter. Extra tokens spend inference compute, and their value depends on the model, task, decoding policy, and verifier. Do not infer a universal accuracy curve from one reasoning-model family or force a fixed amount of extra text on every task.

The deeper takeaway: CoT was originally a property of the prompt. After RLVR, CoT is a property of the trained policy. The expressiveness ceiling (Merrill-Sabharwal: $\text{P}$ with polynomial chain length) and the faithfulness gap (Lanham, Turpin) both still apply.

Hidden Reasoning

Quiet-STaR (Zelikman et al. 2024) takes the next step: rather than training a model to write its reasoning out, train it to insert silent "thought" tokens between every word of training text and rewards thoughts that improve the model's prediction of the next visible token.

Definition

Hidden Chain-of-Thought (Quiet-STaR)

At each token position $t$ during generation, the model emits a sequence of unrendered "thought" tokens before emitting the next visible token. Thoughts are sampled in parallel across positions for efficiency, and a REINFORCE-style update increases the probability of thought sequences that improve the model's likelihood on the subsequent visible tokens. The thoughts are discarded from the rendered output.

Quiet-STaR yields zero-shot gains without task-specific fine-tuning: GSM8K moves from 5.9% to 10.9% and CommonsenseQA from 36.3% to 47.2% on Mistral 7B base after Quiet-STaR pretraining (Zelikman et al. 2024).

This shifts the safety story considerably. With visible CoT, faithfulness is already fragile (Lanham et al., Turpin et al.); with hidden CoT, the internal reasoning is by construction not in the rendered output. Anthropic and OpenAI have both flagged hidden reasoning as a category that voids any "read the chain" interpretability strategy. Whether o1's summarized internal chain (the visible portion shown to users) is faithful to the hidden raw trace is itself an open question.

2025 Update: Monitorability Is a Signal, Not a Certificate

The current safety evidence draws a useful distinction between faithfulness (whether a trace names the factors that changed the answer) and monitorability (whether a monitor can predict a property of the model from the trace). Neither property follows merely because a model writes a long chain.

Chen et al. (Anthropic, 2025) used controlled answer-changing hints and asked whether Claude 3.7 Sonnet and DeepSeek R1 acknowledged a hint that influenced their answer. Their traces often omitted the hint; in their synthetic reward-hacking training settings, learning a shortcut did not make the model reliably state that shortcut. In that intervention setup, CoT acknowledgement was therefore not reliable. This is not a universal verdict on CoT: the authors note that their tasks did not require the chain to perform the behavior, and their intervention design does not establish faithfulness for all tasks or models.

OpenAI's 2025 monitorability work points in the complementary direction. In its 13-evaluation, 24-environment suite, giving a monitor access to CoT was substantially more informative than giving it only actions and final outputs; longer reasoning was generally easier to monitor in that suite. Those results are specific to the model-monitor pairs and evaluation distribution, not a guarantee that the absence of suspicious text clears a deployment. The authors also flag limited realism and untested properties as open limits.

Treat CoT as one diagnostic channel alongside actions, tool traces, verifier outcomes, and targeted follow-ups. It can surface a reward hack or unsafe plan that the final answer hides, but it cannot certify that no such plan exists. Avoid training directly against particular "bad thought" strings: OpenAI's reward-hacking study found that strong pressure can make intent harder to see while leaving misbehavior possible.

Common Confusions

Watch Out

CoT is not just showing work

The intermediate steps are not cosmetic. They provide genuine computational benefit by allowing the model to store and condition on intermediate results. A model forced to produce the answer token immediately after the question cannot access this serial computation pathway.

Watch Out

CoT does not make the model smarter

CoT does not add new knowledge or capabilities. It allows the model to use existing capabilities more effectively on multi-step problems. If the model cannot perform a single reasoning step correctly, CoT will not fix it.

Watch Out

Self-consistency requires the task to have a unique answer

Majority voting assumes answers can be compared for equality. For open-ended generation (summarization, creative writing), there is no single correct answer to vote on. Self-consistency is specific to tasks with verifiable discrete answers.

Watch Out

Polynomial CoT is not the same as solving NP-hard problems

The Merrill-Sabharwal result says polynomial CoT recognizes exactly $\text{P}$ , not $\text{NP}$ . A polynomial-length chain cannot in general decide SAT correctly without an external solver. Reasoning-model demos that "solve" combinatorial puzzles do so within polynomial-time-tractable subcases, by heuristic search that may be wrong, or with tool calls. The expressiveness ceiling is real.

Watch Out

A longer chain is not always a more accurate chain

Both prompted CoT and RL-trained reasoning models exhibit a non-monotone relationship between chain length and accuracy past a model-specific elbow. Beyond that point, additional tokens introduce drift, repetition, and false self-corrections. Forcing very long chains via prompts like "think for 50 more steps" usually hurts.

Watch Out

o1-style training does not repeal pretraining

RLVR is applied on top of a strong base model. R1-Zero (DeepSeek's pure-RL ablation) showed reasoning emerges without SFT, but it still requires a base model that can generate coherent multi-step text. Pretraining quality remains the binding constraint; RL shapes the policy, it does not provide the underlying capability.

Exercises

ExerciseCore

Problem

A model answers a math problem correctly with probability $p = 0.6$ using a single CoT chain. If you sample $m = 15$ independent chains and take the majority vote, what is the upper bound on the probability of an incorrect majority answer (using Hoeffding's bound)?

ExerciseAdvanced

Problem

Explain why a constant-depth transformer cannot compute the composition of $k$ arbitrary functions $f_k \circ \cdots \circ f_1(x)$ for unbounded $k$ , but a transformer with CoT (producing $k$ intermediate tokens) can. What computational complexity class corresponds to each case?

ExerciseAdvanced

Problem

Design a faithfulness intervention that distinguishes "the model uses the chain as serial computation" from "the model post-hoc rationalizes a one-pass answer." Specify (a) the protocol, (b) the prediction under each hypothesis, (c) why filler-token substitution alone is insufficient.

ExerciseResearch

Problem

Suppose you train a 7B base model with RLVR on math (verifiable reward, GRPO). Empirically the model learns to produce 5x longer chains than the base. Two hypotheses: (H1) the longer chains are causally responsible for the accuracy gain (more serial compute, in the Merrill-Sabharwal sense); (H2) the longer chains are stylistic, and the gain comes from the policy shifting probability mass toward correct final answers regardless of chain content. Design a controlled comparison that distinguishes them and state what observation would refute each.

References

Prompting and search:

Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models", NeurIPS (2022). arXiv:2201.11903
Kojima et al., "Large Language Models are Zero-Shot Reasoners", NeurIPS (2022). arXiv:2205.11916
Wang et al., "Self-Consistency Improves Chain of Thought Reasoning in Language Models", ICLR (2023). arXiv:2203.11171
Yao et al., "Tree of Thoughts: Deliberate Problem Solving with Large Language Models", NeurIPS (2023). arXiv:2305.10601

Theory:

Merrill & Sabharwal, "The Expressive Power of Transformers with Chain of Thought", ICLR (2024). arXiv:2310.07923
Merrill & Sabharwal, "The Parallelism Tradeoff: Limitations of Log-Precision Transformers", TACL (2023). arXiv:2207.00729
Feng et al., "Towards Revealing the Mystery Behind Chain of Thought: A Theoretical Perspective", NeurIPS (2023). arXiv:2305.15408
Perez, Barcelo & Marinkovic, "Attention is Turing-Complete", JMLR (2021). arXiv:1901.03429

Faithfulness and hidden computation:

Lanham et al. (Anthropic), "Measuring Faithfulness in Chain-of-Thought Reasoning" (2023). arXiv:2307.13702
Turpin et al., "Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting", NeurIPS (2023). arXiv:2305.04388
Pfau, Merrill & Bowman, "Let's Think Dot by Dot: Hidden Computation in Transformer Language Models", COLM (2024). arXiv:2404.15758

Reasoning-model training and test-time compute:

DeepSeek-AI (Guo et al.), "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025). arXiv:2501.12948
OpenAI, "Learning to Reason with LLMs" (2024). openai.com/index/learning-to-reason-with-llms
Snell et al., "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters" (2024). arXiv:2408.03314
Brown et al., "Large Language Monkeys: Scaling Inference Compute with Repeated Sampling" (2024). arXiv:2407.21787
Zelikman et al., "Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking", COLM (2024). arXiv:2403.09629

2025 safety and monitorability:

Chen et al., "Reasoning Models Don't Always Say What They Think" (Anthropic, 2025), arXiv:2505.05410, Sections 3 and 5. Controlled-hint faithfulness tests and synthetic reward-hacking experiments; their scope does not establish a universal faithfulness rate.
OpenAI, "Detecting misbehavior in frontier reasoning models" (2025), reward-hacking monitoring report. CoT-monitor pressure can make intent harder to see while misbehavior remains possible.
OpenAI, "Evaluating Chain-of-Thought Monitorability" (2025), evaluation framework and limitations. Reports 13 evaluations across 24 environments and distinguishes monitorability from a guarantee of faithful introspection.

Next Topics

DPO vs GRPO vs RL Reasoning — how RLVR-trained reasoning models are actually optimized
Scaling Laws — for the test-time compute scaling axis and its trade-offs with training compute
Hallucination Theory — for why RLVR on math does not extend cleanly to open-domain factuality

Last reviewed: July 11, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Scaling Lawslayer 4 · tier 1
Decoding Strategieslayer 3 · tier 2
Transformer Architecturelayer 4 · tier 2
Prompt Engineering and In-Context Learninglayer 5 · tier 2

Derived topics

2

DPO vs GRPO vs RL for Reasoninglayer 5 · tier 2
Tool-Augmented Reasoninglayer 5 · tier 2

Graph-backed continuations

DPO vs GRPO vs RL for Reasoning Tool-Augmented Reasoning