RL Theory
Agentic RL and Tool Use
The shift from passive sequence generation to autonomous multi-turn decision making. LLMs as RL policies, tool use as actions, ReAct, AgentRL, and why agentic RL differs from chat RLHF.
Prerequisites
Why This Matters
Standard RLHF trains a model to produce a single good response to a single prompt. Agentic RL trains a model to solve multi-step problems by taking actions in an environment: running code, searching the web, calling APIs, navigating UIs, and deciding when to stop.
This is a fundamental shift. A chat model is a single-turn function: . An agent is a multi-turn policy: , running for many steps until a task is complete.
The RL challenges are harder: horizons are longer, rewards are sparser, actions have real consequences (a wrong API call cannot be undone), and the state space includes the external world. Understanding the formal RL framework for agents explains why building reliable agents is much harder than building good chatbots.
Mental Model
Consider the difference between:
- Chat model: "Write me a Python function to sort a list." One response. Done.
- Agent: "Find the top 5 trending ML papers this week and summarize them." Search the web. Read multiple pages. Filter results. Synthesize. Decide when to stop. 10-50 actions over multiple turns.
The agent must decide what to do next at every step, handle failures (a search returns no results, an API errors out), and manage a growing context of past actions and observations. This is a sequential decision problem. exactly the setting RL was designed for.
The Agent as an MDP
LLM Agent as a Markov Decision Process
Statement
An LLM agent can be formulated as a partially observable MDP (POMDP):
- State : The full environment state (file system contents, web page state, conversation history, tool outputs). Typically not fully observable.
- Observation : What the agent sees. The text representation of the current state (tool output, error message, retrieved content).
- Action : The agent's next output. a tool call (code execution, web search, API request), a text response, or a decision to terminate.
- Transition : The environment dynamics (code execution results, web page responses). Stochastic and partially known.
- Reward : Typically sparse. a final reward at task completion (did the agent solve the problem?) with zero intermediate reward.
The agent's policy is the LLM itself: given the history of observations, it generates the next action as a text string.
The objective is to maximize the expected cumulative reward:
where is the (variable) episode length and is the discount factor.
Intuition
The LLM is the "brain" of the agent. It processes observations (text from the environment), makes decisions (which tool to call, what arguments to use), and updates its plan based on results. The context window is the agent's "working memory". It contains the history of actions and observations. When the context window fills up, the agent must compress or summarize, which introduces information loss.
Why It Matters
This formulation connects LLM agents to the vast RL theory literature. Concepts like exploration-exploitation tradeoff, credit assignment, temporal abstraction, and reward shaping all apply directly. The formulation also reveals why agents are hard: long horizons, sparse rewards, and partial observability are exactly the settings where RL struggles most.
Failure Mode
The POMDP formulation assumes the agent's policy is Markov given the full observation history. In practice, the LLM has a finite context window, so it cannot condition on arbitrarily long histories. When episodes exceed the context length, the agent loses access to early observations and actions. This is not just a technical limitation; it means agents cannot be truly Markov for long tasks, introducing a systematic source of error.
Tool Use as Actions
Tool-Augmented LLM
A tool-augmented LLM has access to a set of tools , each with a typed interface (input schema, output schema). At each step, the agent either:
- Calls a tool: Generates a structured tool call , receives the tool output
- Generates text: Produces a text response (reasoning, answer, etc.)
- Terminates: Signals task completion
The action space is .
Common tools in 2026 agents:
- Code execution: Python interpreter, shell commands
- Web search: Search engine queries, page retrieval
- API calls: Database queries, external service requests
- File operations: Read, write, edit files
- UI interaction: Browser navigation, click, type
The tool set defines the agent's capabilities. An agent without tools is just a chatbot. An agent with code execution can solve problems that require computation. An agent with web search can access current information. The choice of tools is a design decision that shapes the agent's effective competence.
The ReAct Pattern
ReAct (Reasoning + Acting)
The ReAct pattern interleaves reasoning (chain-of-thought) with acting (tool calls) in an explicit loop:
Thought: I need to find the current stock price of AAAI.
Action: web_search("AAAI stock price today")
Observation: [search results]
Thought: The search results show the price is $142.50. Let me verify...
Action: web_search("AAAI stock price Yahoo Finance")
Observation: [Yahoo Finance result]
Thought: Confirmed. The price is $142.50. I can now answer.
Action: respond("The current AAAI stock price is $142.50.")
Each cycle consists of: Thought (reasoning about what to do), Action (tool call or response), Observation (result from the environment). The LLM generates the Thought and Action; the environment generates the Observation.
ReAct is the dominant pattern for 2025-2026 agents because it:
- Makes reasoning explicit and inspectable
- Allows the agent to plan before acting
- Provides a natural structure for multi-step problem solving
- Enables human oversight (you can read the thoughts)
The limitation: ReAct generates reasoning tokens that consume context space. For long tasks, the growing history of thoughts, actions, and observations can exceed the context window.
Training Agentic Policies
Policy Gradient for Tool-Augmented Agents
Statement
For an agent executing a trajectory with episode reward , the policy gradient is:
Each action is a sequence of tokens (the tool call or text output), so:
The gradient reinforces entire action sequences (tool calls with arguments) that led to successful episodes.
Intuition
The policy gradient pushes the agent to repeat actions that led to high reward and avoid actions that led to low reward. But the reward comes only at the end of a long episode. Which of the 20 actions was responsible for success? This is the credit assignment problem: the fundamental difficulty of RL with sparse rewards over long horizons.
Why It Matters
This is the mathematical framework for training agents with RL. It shows why agentic RL is harder than chat RLHF: the sum over timesteps introduces high variance, the sparse reward provides weak signal per action, and the combinatorial action space (all possible tool calls with all possible arguments) is enormous.
Failure Mode
With sparse rewards and long horizons, the REINFORCE estimator has extremely high variance. A 20-step episode with binary reward gives each action a gradient proportional to the same episode-level reward, regardless of whether that specific action contributed to success. Variance reduction techniques (baselines, advantage estimation) help but do not fully solve the problem. This is why most agentic RL systems supplement sparse rewards with shaped intermediate rewards (e.g., partial credit for intermediate progress).
How Agentic RL Differs from Chat RLHF
| Property | Chat RLHF | Agentic RL |
|---|---|---|
| Horizon | 1 turn (single response) | 5-100+ turns |
| Reward | Dense (reward per response) | Sparse (reward at task completion) |
| Actions | Text generation | Tool calls + text |
| State | Fixed prompt | Evolving environment |
| Consequences | None (just text) | Real (code runs, files change) |
| Failure recovery | N/A | Must handle errors and retry |
| Credit assignment | Trivial (one action) | Hard (many actions) |
The key insight: chat RLHF is easy RL (single-step bandit problem). Agentic RL is hard RL (sequential decision problem with all the classic challenges: exploration, credit assignment, partial observability, long horizons).
Multimodal Agents
The frontier of agentic RL extends beyond text to multimodal interaction:
- UI agents (Magma, CogAgent): Navigate graphical interfaces by observing screenshots and producing mouse clicks and keyboard inputs
- Embodied agents: Interact with physical or simulated environments
- Multi-tool agents: Combine code execution, web browsing, file editing, and API calls in a single episode
Multimodal agents process visual observations (screenshots) alongside text, expanding the observation space and adding new action types (click at coordinates, scroll, type into a field).
"Agents" does not mean AGI. The word "agent" in the LLM context has a precise technical meaning: an LLM used as a policy in a multi-step decision-making loop with tool access. This is RL with an LLM as the policy and tools as the action space. It is not general intelligence, consciousness, or autonomous goal-pursuit. An LLM agent that browses the web and writes code is doing the same thing as a robot that picks up objects: executing a learned policy in an environment. The capabilities are impressive but the mechanism is ordinary RL. Treating "agentic" as synonymous with "autonomous and goal-directed in the human sense" leads to confused safety analysis and inflated capability claims.
Training Infrastructure
Training agentic policies requires infrastructure beyond standard LLM training:
- Environments: Sandboxed execution environments for code, browsers, APIs. Each training episode requires spinning up and tearing down an environment instance.
- Trajectory collection: Episodes are collected by running the agent in the environment, which is much slower than sampling text (tool calls have latency, code execution takes time).
- Reward functions: Task-specific reward functions that check whether the agent completed the objective. Often hand-crafted per task category.
- Safety constraints: The agent must not perform irreversible harmful actions during training (delete important files, send unauthorized emails). Sandboxing is essential.
Common Confusions
Agentic RL is not just prompting with tools
Prompting a model with tool descriptions and examples is not RL. It is in-context learning. The model uses its pretrained knowledge to guess how to use tools. Agentic RL actually updates the model's weights based on success or failure in the environment. The difference matters: prompted agents hit a ceiling set by pretraining, while RL-trained agents can surpass it.
Function calling is not the same as agentic reasoning
Function calling (structured tool invocation) is a single action. Agentic reasoning is the ability to plan a sequence of actions, observe results, adapt the plan, handle failures, and decide when to stop. A model that can call functions is not necessarily an agent. It may just be a better-formatted chatbot. The "agentic" property is about multi-step sequential decision-making, not single-step tool invocation.
Longer context does not solve the horizon problem
A longer context window helps the agent remember more of its history, but it does not solve the RL challenges of credit assignment and exploration. Even with infinite context, the agent still needs to figure out which of its many actions was responsible for success (credit assignment) and decide whether to try new strategies versus exploit known ones (exploration). These are fundamental RL problems, not context length problems.
Summary
- LLM agents are RL policies: observation in, action out, multi-step episodes
- Agent MDP: state = environment, action = tool call or text, reward = task completion
- Tool use defines the action space: code execution, web search, APIs, UI
- ReAct pattern: interleave reasoning (Thought) with acting (Action) and observing (Observation)
- Agentic RL is harder than chat RLHF: longer horizons, sparser rewards, real consequences
- Credit assignment over long episodes is the core difficulty
- Policy gradient for agents: REINFORCE over multi-step trajectories with high variance
- "Agent" means LLM + RL + tools, not AGI or consciousness
- Training requires sandboxed environments and task-specific reward functions
Exercises
Problem
An LLM agent solves a coding task in 10 steps: 8 actions are code edits and 2 are test executions. The final test passes (reward = 1). Under REINFORCE without a baseline, what gradient does each action receive? Why is this problematic?
Problem
Compare the effective action space of a chat model (single-turn RLHF) versus an agentic model with 5 tools, each taking a string argument of up to 100 tokens. Assuming a vocabulary of 50,000 tokens, estimate the action space sizes and explain the implications for exploration.
Problem
The credit assignment problem in agentic RL can be partially addressed by hindsight analysis: after a successful episode, identify which actions were critical by counterfactual reasoning. Formalize this: define a "criticality score" for action in a successful trajectory, and describe how you would estimate it using the model itself.
References
Canonical:
- Yao et al., "ReAct: Synergizing Reasoning and Acting in Language Models" (2023)
- Schick et al., "Toolformer: Language Models Can Teach Themselves to Use Tools" (2023)
Current:
- DeepSeek-AI, "DeepSeek-R1" (2025). RL-trained reasoning agent
- Zheng et al., "Magma: A Foundation Model for Multimodal AI Agents" (2025). multimodal UI agents
- Shinn et al., "Reflexion: Language Agents with Verbal Reinforcement Learning" (2023)
- Wang et al., "Voyager: An Open-Ended Embodied Agent with Large Language Models" (2023)
Next Topics
The natural next steps from agentic RL:
- Post-training overview: how agent capabilities are built into the training pipeline
- Test-time compute and search: search strategies that agents use at inference time
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Markov Decision ProcessesLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Policy Gradient TheoremLayer 3
Builds on This
- Agent Protocols: MCP and A2ALayer 5
- Multi-Agent CollaborationLayer 4
- Tool-Augmented ReasoningLayer 5