LLM Construction
Tool-Augmented Reasoning
LLMs that call external tools during reasoning: Toolformer for learning when to invoke APIs, ReAct for interleaving thought and action, and code-as-thought for replacing verbal arithmetic with executed programs.
Why This Matters
LLMs are unreliable at arithmetic, symbolic manipulation, and factual recall. A 70B parameter model asked to multiply two 5-digit numbers will frequently produce the wrong answer. But a model that calls a calculator gets it right every time. Tool-augmented reasoning is the practical strategy for converting unreliable verbal computation into reliable verified computation.
The core observation: LLMs are good at deciding what to compute but bad at doing the computation. Tools handle the computation. The model handles the planning.
Mental Model
Think of the LLM as a planner that decomposes a problem into steps, some of which require external execution. When the model encounters a subproblem it cannot solve reliably (arithmetic, database lookup, code execution), it emits a tool call, receives the result, and continues reasoning with that result as given. The model's job shifts from being an oracle to being a dispatcher.
Formal Setup
Let denote a language model and a set of available tools. Each tool takes a text input and returns a text output. A tool-augmented generation produces a sequence of interleaved reasoning tokens and tool calls:
where are reasoning segments, are tool arguments, and are tool outputs inserted back into the context.
Tool-Augmented Language Model
A tool-augmented LM is a pair where is a language model trained or prompted to emit structured tool calls, and is a set of external functions the model can invoke during generation. The model generates tokens autoregressively but may pause generation, call a tool, receive output, and resume generation conditioned on the tool output.
Toolformer: Self-Supervised Tool Learning
Toolformer (Schick et al., 2023) trains a model to decide when and how to call tools, using self-supervision rather than human demonstrations.
The procedure:
- Start with a pretrained LM and a set of APIs (calculator, search, calendar, etc.)
- For each training example, sample candidate tool calls at each position
- Execute each candidate call and check if inserting the result reduces loss on subsequent tokens
- Keep the calls that reduce loss; discard the rest
- Fine-tune the model on the augmented dataset
The key insight: the model learns to call tools exactly when doing so improves its own next-token prediction. No human annotation of "when to use a tool" is needed.
Toolformer Filtering Criterion
A candidate tool call at position is kept if:
where is the loss on tokens following position when the tool result is included, is the loss without the tool call, and is a threshold. The call is useful if it reduces prediction loss by at least .
ReAct: Interleaved Reasoning and Acting
ReAct (Yao et al., 2023) structures tool-augmented generation as an alternation of thought steps and action steps.
The format:
- Thought: the model reasons about what it knows and what it needs
- Action: the model calls a tool (search, lookup, calculate)
- Observation: the tool returns a result
- Repeat until the model emits a final answer
ReAct is a prompting strategy, not a training method. It works with any instruction-following model. The explicit thought steps serve two purposes: they give the model space to plan, and they make the reasoning chain interpretable to humans.
ReAct Improves Task Accuracy Over Pure Reasoning or Pure Acting
Statement
On the HotpotQA multi-hop question answering benchmark, ReAct (interleaved thought and action) achieves higher exact-match accuracy than either chain-of-thought alone (reasoning without tool access) or act-only (tool calls without explicit reasoning). Yao et al. (2023) report ReAct outperforms CoT by 6+ points on tasks requiring factual grounding and outperforms act-only by 5+ points on tasks requiring multi-step planning.
Intuition
Chain-of-thought can reason but hallucinates facts. Act-only retrieves facts but does not plan multi-step reasoning. ReAct combines both: think about what you need, retrieve it, then reason with the retrieved facts.
Why It Matters
ReAct established the design pattern now used in nearly all LLM agent frameworks. The thought-action-observation loop is the backbone of LangChain, AutoGPT, and similar systems. The empirical finding that explicit reasoning improves tool-use accuracy has been replicated across many benchmarks.
Failure Mode
ReAct fails when the model's reasoning leads it to call the wrong tool or formulate a bad query. If the retrieval tool returns irrelevant results, the model may incorporate those results into its reasoning, producing confident but wrong answers. ReAct also increases token cost substantially because of the verbose thought-action format.
Code-as-Thought
A specific and powerful form of tool augmentation: instead of reasoning verbally about a computation, generate code, execute it, and use the output.
For mathematical problems, code execution is strictly more reliable than verbal chain-of-thought. A model asked "What is the 50th Fibonacci number?" can either attempt mental arithmetic (error-prone) or write a 3-line Python program (exact).
Code Execution Eliminates Computation Errors on Verifiable Tasks
Statement
Let be the probability that a model produces the correct answer via verbal reasoning, and let be the probability that it generates correct code. On tasks where correctness is verifiable (arithmetic, symbolic manipulation, data processing):
where accounts for the model retrying after execution errors. For arithmetic tasks, empirically : Chen et al. (2023) report that GPT-4 with code interpreter achieves 97% accuracy on MATH problems where verbal CoT achieves only 52%.
Intuition
Verbal reasoning requires the model to be a calculator, a symbolic algebra system, and a programmer all at once, using only next-token prediction. Code execution offloads the computation to a correct-by-construction runtime. The model only needs to specify the computation, not perform it.
Why It Matters
This result explains why code interpreter produces the largest measured accuracy gain of any tool for LLM reasoning. The 45-percentage-point improvement on MATH from adding code execution is larger than most architectural improvements. For any task involving computation, code-as-thought should be the default strategy.
Failure Mode
Code execution fails when the model writes incorrect code that still runs without error (logical bugs, not syntax errors). It also fails on tasks that are not easily expressible as code: common sense reasoning, ethical judgments, creative writing. The model must also handle the case where code produces runtime errors, which requires the ability to debug and retry.
Why Tool Use Improves Reliability on Verifiable Tasks
The key property is verifiability. For tasks where correctness can be checked (math: does the answer satisfy the equation? code: does it pass tests? search: does the source confirm the claim?), tool use converts an unreliable stochastic process (LLM generation) into a reliable deterministic one (tool execution).
This does not help for tasks without clear verification criteria. Generating a persuasive essay cannot be verified by a tool. Multiplying two matrices can.
Common Confusions
Tool augmentation is not fine-tuning
Toolformer requires fine-tuning, but most tool-augmented systems (ReAct, function calling, code interpreter) work via prompting or API design. The model does not need to be retrained to use tools; it needs to be instructed on the tool format and given examples.
More tools do not always help
Adding irrelevant tools increases the probability that the model calls the wrong one. Tool selection is itself a reasoning task that can fail. Systems with 3-5 well-chosen tools typically outperform systems with 50 poorly documented ones.
Tool use does not eliminate hallucination
A model can still hallucinate in the reasoning steps between tool calls. It can also misinterpret tool outputs. Tool use reduces hallucination on verifiable subproblems but does not address hallucination in planning, interpretation, or synthesis steps.
Summary
- LLMs are good at deciding what to compute, bad at doing the computation
- Toolformer: self-supervised learning of when to call tools, filtering by loss reduction
- ReAct: thought-action-observation loop for interleaved reasoning and tool use
- Code-as-thought: generate and execute code instead of verbal reasoning
- Tool use dramatically improves accuracy on verifiable tasks (52% to 97% on MATH)
- The improvement comes from verifiability: tools provide deterministic computation
- More tools is not always better; tool selection is itself an error-prone step
Exercises
Problem
A model achieves 60% accuracy on arithmetic word problems using verbal chain-of-thought. With code interpreter, it generates correct Python code 80% of the time. When the code has a bug, the model successfully debugs and retries with 50% probability. What is the overall accuracy with the code interpreter?
Problem
Design a Toolformer-style filtering experiment. You have a pretrained LM, a calculator API, and a dataset of math word problems. Describe the steps to determine which positions in the training text benefit from calculator calls. What loss function do you use for filtering, and what is the threshold ?
References
Canonical:
- Schick et al., "Toolformer: Language Models Can Teach Themselves to Use Tools" (2023)
- Yao et al., "ReAct: Synergizing Reasoning and Acting in Language Models" (2023)
Current:
- Chen et al., "Program of Thoughts Prompting" (2023)
- Gao et al., "PAL: Program-Aided Language Models" (2023)
- Parisi et al., "TALM: Tool Augmented Language Models" (2022)
Next Topics
The natural next steps from tool-augmented reasoning:
- Agent protocols (MCP, A2A): standardized interfaces for tool communication
- Structured output and constrained generation: ensuring tool calls are well-formed
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Agentic RL and Tool UseLayer 5
- Markov Decision ProcessesLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Policy Gradient TheoremLayer 3
- Chain-of-Thought and ReasoningLayer 5
- Prompt Engineering and In-Context LearningLayer 5
- Transformer ArchitectureLayer 4
- Attention Mechanism TheoryLayer 4
- Softmax and Numerical StabilityLayer 1
- Feedforward Networks and BackpropagationLayer 2
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1