AI Agents Learn Better with Turn-by-Turn Feedback

TL;DR

A new training method stabilizes AI that uses search tools, boosting accuracy by over 10% without extra labels or reward models.

Training AI agents to use tools like search engines for answering questions is notoriously unstable, often leading to poor performance or complete failure during learning. This brittleness stems from a fundamental : when an AI interacts with tools over multiple steps, it only receives a single reward at the end—correct or incorrect—making it hard to pinpoint which intermediate actions were helpful. A new approach called Turn-Level Information Potential Reward Shaping (TIPS) addresses this by providing dense, turn-by-turn feedback, significantly improving training stability and accuracy across diverse question-answering tasks.

TIPS works by assigning a reward to each turn of interaction, based on how much that turn increases the AI's confidence in generating a correct answer. Specifically, it uses a frozen copy of the AI model itself as a teacher to measure the change in log-likelihood of acceptable answers after each turn, such as issuing a search query and receiving . Turns that make the correct answer more predictable receive positive credit, while distracting or misleading turns get little or negative credit. This converts sparse outcome-only supervision into fine-grained guidance, helping the AI learn which tool calls and reasoning steps are truly informative.

The researchers implemented TIPS within a standard reinforcement learning framework, integrating it with Proximal Policy Optimization (PPO). They formalized the interaction as a segment-level Markov decision process, where each turn corresponds to a segment of reasoning, tool invocation, and observation. By framing the turn-level rewards as potential-based shaping, TIPS preserves the optimal policy of the original task while providing denser learning signals. This approach requires no separate reward models, human process labels, or external verifiers—only checkpoints of the model being trained—making it practical for scaling to large models.

Experiments across seven in-domain and out-of-domain benchmarks, including NQ, HotpotQA, and MuSiQue, show that TIPS consistently outperforms baseline s like PPO and GRPO. For the Qwen-2.5-7B Instruct model, TIPS improved average Exact Match by 11.8% and F1 score by 13.6% over PPO, with even larger gains on multi-hop and out-of-domain tasks. Training dynamics, as illustrated in Figure 3, reveal that TIPS climbs steadily to high accuracy plateaus with low variance, whereas GRPO suffers performance collapse and PPO stagnates or drifts. Analysis of token-level advantage distributions in Figure 5 shows that TIPS yields a clean bimodal distribution with concentrated positive mass, indicating more stable learning compared to PPO's fat-tailed and near-zero mass patterns.

Of TIPS extend beyond search-augmented question answering, offering a general mechanism for stabilizing long-horizon reinforcement learning in tool-using AI agents. By providing turn-level credit assignment without additional labeling overhead, it could enable more reliable training of AI systems that interact with databases, code execution environments, or other external tools. 's modest computational overhead—around 11-18% in FLOPs and wall-clock time—makes it feasible for real-world applications, as shown in Table 3 where it improved performance across model families like Qwen and Llama.

Despite its strengths, TIPS has limitations. The computational overhead, though manageable, adds cost to training, and is currently tied to PPO, limiting its integration with other optimization algorithms. Additionally, the teacher model must be periodically refreshed to stay aligned with the policy; if it becomes stale, the shaping signal can degrade. The paper notes that future work will explore quicker refresh strategies and test TIPS in reasoning-heavy domains like programming and math, which could further validate its generalizability.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn