Training AI agents to handle multi-turn conversations and reasoning tasks has long been plagued by instability, often causing models to collapse and lose performance unexpectedly. This issue is particularly critical as large language models (LLMs) are increasingly deployed in agentic settings, such as interactive search and medical question-answering, where they must coordinate multiple steps of retrieval and reasoning. A new study introduces ST-PPO, a stabilized version of the popular proximal policy optimization (PPO) algorithm, designed to address these training instabilities and enable more robust learning for multi-turn agents.
The researchers identified two main sources of instability in standard PPO when applied to multi-turn LLM agent training. First, token-level importance sampling, which optimizes at the level of individual tokens, misaligns with the natural granularity of multi-turn environments that operate in distinct turn-level stages, such as problem analysis, query formulation, and information processing. Second, off-policy updates rely on critic-based advantage estimates that are often unreliable for out-of-distribution tokens, leading to high-variance gradients and unstable updates. Through empirical analysis, including failed runs with models like Qwen2.5-7B, they observed that these factors cause performance collapses, as shown in Figure 2, where advantage estimates become highly variable and success rates drop sharply.
To tackle these s, the team proposed two complementary stabilization techniques. Turn-level importance sampling aligns optimization with the natural structure of multi-turn reasoning by computing importance weights at the turn level rather than the token level, as formalized in Lemma 4.1. This approach, illustrated in Figure 1, allows for more precise credit assignment across reasoning phases. Additionally, clipping-bias correction normalizes gradients by downweighting unreliable, highly off-policy samples, addressing the variance from inaccurate critic estimates, as detailed in Lemma 4.2. Depending on how these components are combined, the researchers developed three variants: Turn-PPO (turn-level sampling only), S-PPO (clipping-bias correction applied to token-level PPO), and ST-PPO (turn-level sampling combined with clipping-bias correction).
Experiments on multi-turn search tasks across benchmarks like Natural Questions (NQ), HotpotQA, and medical multiple-choice QA demonstrated that ST-PPO and S-PPO consistently prevent the performance collapses observed in large-model training. As shown in Figure 4, these s maintain stable success rates without the sharp declines seen in token-level PPO and GRPO, which often require early stopping. Figure 5 further reveals that ST-PPO and S-PPO achieve lower clipping ratios and KL divergence throughout optimization, indicating more reliable gradient updates. In medical tasks, ST-PPO achieved the best average accuracy of 49.90% on benchmarks like MedQA and MedMCQA, outperforming retrieval-augmented and RL-enhanced baselines, as detailed in Table 1.
Of this work are significant for real-world applications where AI agents must perform complex, multi-step reasoning without failing mid-training. By stabilizing training, ST-PPO enables more scalable and efficient reinforcement learning for LLMs in domains like healthcare, where models can reliably retrieve and synthesize information from external sources. However, the study acknowledges limitations, such as the need for turn boundary identification, which relies on loss masks in the current implementation, and the potential for residual instability in extremely off-policy settings, as noted in supplementary experiments. Future research could explore extending these stabilization techniques to other reinforcement learning algorithms or adapting them for even longer-horizon tasks.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn