Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Tool-Augmented Reasoning

LLMs that call external tools during reasoning: Toolformer for learning when to invoke APIs, ReAct for interleaving thought and action, and code-as-thought for replacing verbal arithmetic with executed programs.

AdvancedTier 2Frontier~50 min
0

Why This Matters

LLMs are unreliable at arithmetic, symbolic manipulation, and factual recall. A 70B parameter model asked to multiply two 5-digit numbers will frequently produce the wrong answer. But a model that calls a calculator gets it right every time. Tool-augmented reasoning is the practical strategy for converting unreliable verbal computation into reliable verified computation.

The core observation: LLMs are good at deciding what to compute but bad at doing the computation. Tools handle the computation. The model handles the planning.

Mental Model

Think of the LLM as a planner that decomposes a problem into steps, some of which require external execution. When the model encounters a subproblem it cannot solve reliably (arithmetic, database lookup, code execution), it emits a tool call, receives the result, and continues reasoning with that result as given. The model's job shifts from being an oracle to being a dispatcher.

Formal Setup

Let MM denote a language model and T={t1,,tk}T = \{t_1, \ldots, t_k\} a set of available tools. Each tool tit_i takes a text input and returns a text output. A tool-augmented generation produces a sequence of interleaved reasoning tokens and tool calls:

r1,[CALL ti(a1)]o1,r2,[CALL tj(a2)]o2,,rnr_1, \texttt{[CALL } t_i(a_1)\texttt{]} \to o_1, r_2, \texttt{[CALL } t_j(a_2)\texttt{]} \to o_2, \ldots, r_n

where rir_i are reasoning segments, aia_i are tool arguments, and oio_i are tool outputs inserted back into the context.

Definition

Tool-Augmented Language Model

A tool-augmented LM is a pair (M,T)(M, T) where MM is a language model trained or prompted to emit structured tool calls, and TT is a set of external functions the model can invoke during generation. The model generates tokens autoregressively but may pause generation, call a tool, receive output, and resume generation conditioned on the tool output.

Toolformer: Self-Supervised Tool Learning

Toolformer (Schick et al., 2023) trains a model to decide when and how to call tools, using self-supervision rather than human demonstrations.

The procedure:

  1. Start with a pretrained LM and a set of APIs (calculator, search, calendar, etc.)
  2. For each training example, sample candidate tool calls at each position
  3. Execute each candidate call and check if inserting the result reduces loss on subsequent tokens
  4. Keep the calls that reduce loss; discard the rest
  5. Fine-tune the model on the augmented dataset

The key insight: the model learns to call tools exactly when doing so improves its own next-token prediction. No human annotation of "when to use a tool" is needed.

Definition

Toolformer Filtering Criterion

A candidate tool call cc at position ii is kept if:

Li(e+)Li(e)τL_i(e^+) - L_i(e^-) \geq \tau

where Li(e+)L_i(e^+) is the loss on tokens following position ii when the tool result is included, Li(e)L_i(e^-) is the loss without the tool call, and τ\tau is a threshold. The call is useful if it reduces prediction loss by at least τ\tau.

ReAct: Interleaved Reasoning and Acting

ReAct (Yao et al., 2023) structures tool-augmented generation as an alternation of thought steps and action steps.

The format:

  • Thought: the model reasons about what it knows and what it needs
  • Action: the model calls a tool (search, lookup, calculate)
  • Observation: the tool returns a result
  • Repeat until the model emits a final answer

ReAct is a prompting strategy, not a training method. It works with any instruction-following model. The explicit thought steps serve two purposes: they give the model space to plan, and they make the reasoning chain interpretable to humans.

Proposition

ReAct Improves Task Accuracy Over Pure Reasoning or Pure Acting

Statement

On the HotpotQA multi-hop question answering benchmark, ReAct (interleaved thought and action) achieves higher exact-match accuracy than either chain-of-thought alone (reasoning without tool access) or act-only (tool calls without explicit reasoning). Yao et al. (2023) report ReAct outperforms CoT by 6+ points on tasks requiring factual grounding and outperforms act-only by 5+ points on tasks requiring multi-step planning.

Intuition

Chain-of-thought can reason but hallucinates facts. Act-only retrieves facts but does not plan multi-step reasoning. ReAct combines both: think about what you need, retrieve it, then reason with the retrieved facts.

Why It Matters

ReAct established the design pattern now used in nearly all LLM agent frameworks. The thought-action-observation loop is the backbone of LangChain, AutoGPT, and similar systems. The empirical finding that explicit reasoning improves tool-use accuracy has been replicated across many benchmarks.

Failure Mode

ReAct fails when the model's reasoning leads it to call the wrong tool or formulate a bad query. If the retrieval tool returns irrelevant results, the model may incorporate those results into its reasoning, producing confident but wrong answers. ReAct also increases token cost substantially because of the verbose thought-action format.

Code-as-Thought

A specific and powerful form of tool augmentation: instead of reasoning verbally about a computation, generate code, execute it, and use the output.

For mathematical problems, code execution is strictly more reliable than verbal chain-of-thought. A model asked "What is the 50th Fibonacci number?" can either attempt mental arithmetic (error-prone) or write a 3-line Python program (exact).

Proposition

Code Execution Eliminates Computation Errors on Verifiable Tasks

Statement

Let pverbalp_{\text{verbal}} be the probability that a model produces the correct answer via verbal reasoning, and let pcodep_{\text{code}} be the probability that it generates correct code. On tasks where correctness is verifiable (arithmetic, symbolic manipulation, data processing):

P(correct with tool)=pcode+(1pcode)pretryP(\text{correct with tool}) = p_{\text{code}} + (1 - p_{\text{code}}) \cdot p_{\text{retry}}

where pretryp_{\text{retry}} accounts for the model retrying after execution errors. For arithmetic tasks, empirically pcodepverbalp_{\text{code}} \gg p_{\text{verbal}}: Chen et al. (2023) report that GPT-4 with code interpreter achieves 97% accuracy on MATH problems where verbal CoT achieves only 52%.

Intuition

Verbal reasoning requires the model to be a calculator, a symbolic algebra system, and a programmer all at once, using only next-token prediction. Code execution offloads the computation to a correct-by-construction runtime. The model only needs to specify the computation, not perform it.

Why It Matters

This result explains why code interpreter produces the largest measured accuracy gain of any tool for LLM reasoning. The 45-percentage-point improvement on MATH from adding code execution is larger than most architectural improvements. For any task involving computation, code-as-thought should be the default strategy.

Failure Mode

Code execution fails when the model writes incorrect code that still runs without error (logical bugs, not syntax errors). It also fails on tasks that are not easily expressible as code: common sense reasoning, ethical judgments, creative writing. The model must also handle the case where code produces runtime errors, which requires the ability to debug and retry.

Why Tool Use Improves Reliability on Verifiable Tasks

The key property is verifiability. For tasks where correctness can be checked (math: does the answer satisfy the equation? code: does it pass tests? search: does the source confirm the claim?), tool use converts an unreliable stochastic process (LLM generation) into a reliable deterministic one (tool execution).

This does not help for tasks without clear verification criteria. Generating a persuasive essay cannot be verified by a tool. Multiplying two matrices can.

Common Confusions

Watch Out

Tool augmentation is not fine-tuning

Toolformer requires fine-tuning, but most tool-augmented systems (ReAct, function calling, code interpreter) work via prompting or API design. The model does not need to be retrained to use tools; it needs to be instructed on the tool format and given examples.

Watch Out

More tools do not always help

Adding irrelevant tools increases the probability that the model calls the wrong one. Tool selection is itself a reasoning task that can fail. Systems with 3-5 well-chosen tools typically outperform systems with 50 poorly documented ones.

Watch Out

Tool use does not eliminate hallucination

A model can still hallucinate in the reasoning steps between tool calls. It can also misinterpret tool outputs. Tool use reduces hallucination on verifiable subproblems but does not address hallucination in planning, interpretation, or synthesis steps.

Summary

  • LLMs are good at deciding what to compute, bad at doing the computation
  • Toolformer: self-supervised learning of when to call tools, filtering by loss reduction
  • ReAct: thought-action-observation loop for interleaved reasoning and tool use
  • Code-as-thought: generate and execute code instead of verbal reasoning
  • Tool use dramatically improves accuracy on verifiable tasks (52% to 97% on MATH)
  • The improvement comes from verifiability: tools provide deterministic computation
  • More tools is not always better; tool selection is itself an error-prone step

Exercises

ExerciseCore

Problem

A model achieves 60% accuracy on arithmetic word problems using verbal chain-of-thought. With code interpreter, it generates correct Python code 80% of the time. When the code has a bug, the model successfully debugs and retries with 50% probability. What is the overall accuracy with the code interpreter?

ExerciseAdvanced

Problem

Design a Toolformer-style filtering experiment. You have a pretrained LM, a calculator API, and a dataset of math word problems. Describe the steps to determine which positions in the training text benefit from calculator calls. What loss function do you use for filtering, and what is the threshold τ\tau?

References

Canonical:

  • Schick et al., "Toolformer: Language Models Can Teach Themselves to Use Tools" (2023)
  • Yao et al., "ReAct: Synergizing Reasoning and Acting in Language Models" (2023)

Current:

  • Chen et al., "Program of Thoughts Prompting" (2023)
  • Gao et al., "PAL: Program-Aided Language Models" (2023)
  • Parisi et al., "TALM: Tool Augmented Language Models" (2022)

Next Topics

The natural next steps from tool-augmented reasoning:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics