Agent Training And Fine-Tuning

Created: 2026-03-03 10:22
#note

Training Large Language Models (LLMs) for agentic capabilities — tool use, multi-step reasoning, planning, and environment interaction — requires techniques that go beyond standard alignment. While RLHF - Reinforcement Learning from Human Feedback and DPO - Direct Preference Optimization optimise single-turn quality, agent training must handle trajectory-level rewards, long-horizon credit assignment, and verification of multi-step action sequences. The field has evolved rapidly since 2024, with frameworks like FireAct, AgentTuning, and Agent-R1 pushing toward scalable agentic RL. This note covers the training and fine-tuning landscape for LLM-based agents. Part of the broader LLM Training and Alignment Evolution.

Training Approaches

Supervised Agent Fine-Tuning

The simplest approach: generate high-quality agent trajectories (tool calls, reasoning chains, action sequences) and fine-tune the model on them.

FireAct (2024) — generates synthetic agent trajectories using GPT-4, then fine-tunes smaller models (Llama-2 7B). 500 trajectories achieved 77% improvement on HotpotQA. Key insight: quality of trajectories matters far more than quantity
Toolformer (Meta, 2023) — self-supervised approach where the model learns to insert API calls into its own text by predicting where tools would reduce perplexity
Gorilla (Berkeley, 2023) — fine-tuned on API documentation to generate accurate tool calls, with retrieval-augmented training to stay current with API changes
AgentTuning (2024) — curates diverse agent interaction traces across multiple environments (web browsing, code execution, database queries) for generalised agent SFT

Reinforcement Learning for Agents

RL-based training uses trajectory-level rewards to optimise agent behaviour end-to-end.

Agent-R1 (November 2025) — multi-turn trajectory optimisation framework extending RLVF - Reinforcement Learning from Verifiable Feedback to agentic settings. Uses turn-level advantage estimation and environment feedback (code execution, API responses) as reward signals
CodeAct paradigm — agents execute code in sandboxed environments, receiving direct execution feedback as the reward signal. Connects to RLVF: if the code runs and produces the correct output, reward = 1
CARD framework — uses LLMs to design reward functions for agent training, enabling reward specification for complex tasks without manual engineering

graph TD
    A["Agent receives task"] --> B["Plans action sequence"]
    B --> C["Executes actions<br/>(tool calls, code, API)"]
    C --> D["Environment feedback"]
    D --> E{"Verifiable<br/>outcome?"}
    E -->|Yes| F["RLVF reward"]
    E -->|No| G["Learned reward<br/>or human feedback"]
    F --> H["Policy update<br/>(GRPO / PPO)"]
    G --> H

Credit Assignment in Long Trajectories

The central challenge: when an agent takes 10+ steps to complete a task, which steps contributed to success or failure?

Entropy-Modulated Policy Gradients (EMPG) — weights gradient updates by the entropy of each action, focusing learning on high-uncertainty decision points
Turn-level advantage estimators — compute advantages at each turn rather than only at the trajectory end, providing denser learning signal
Group-in-Group Policy Optimization (GiGPO) — extends GRPO - Group Relative Policy Optimization to multi-turn settings by grouping at both the trajectory level and the turn level
Process Reward Models — step-level verification (as in RLVF - Reinforcement Learning from Verifiable Feedback) applied to agent reasoning steps

Industry Approaches (2024–2025)

Anthropic — Claude's agent capabilities trained with a combination of SFT on curated tool-use traces and RL with environment feedback. Emphasis on safe agentic behaviour through Constitutional AI principles applied to agent actions
OpenAI — Operator and agent-mode capabilities leveraging RLVF with code execution and web browsing verification
Google — Project Mariner for browser-based agents, with training on web interaction trajectories

Training with Environment Feedback

A defining feature of agent training is that the environment itself provides feedback:

Code execution — compiler errors, test results, runtime exceptions
API responses — success/failure codes, returned data validation
Web browsing — page load success, element interaction results, task completion verification
Database queries — result set validation, query execution success

This naturally connects to RLVF - Reinforcement Learning from Verifiable Feedback — the environment acts as the verifier.

Open Problems

Credit assignment remains the hardest problem: attributing trajectory-level success to individual decisions in 10–50 step sequences
Exploration vs exploitation — agents need to explore diverse strategies during training but exploit reliable patterns at deployment
Offline RL scaling — training on logged agent trajectories (offline) without environment interaction is cheaper but introduces distribution shift
Reward signal combination — how to combine verifiable rewards (code ran correctly), learned rewards (plan quality), and safety constraints (no harmful actions) into a coherent training signal
Generalisation — agents trained on specific tool sets and environments struggle to transfer to new ones