Large language models often struggle with complex mathematical and logical problems that require multi-step reasoning. While these AI systems excel at many tasks, their performance drops significantly when faced with challenging problems where traditional training methods fail. This limitation has hindered the development of more capable AI assistants for scientific and technical applications.
The researchers developed a new training method called Stepwise Reinforcement Learning (SRL) that teaches AI models to solve complex problems by breaking them down into manageable steps. Unlike previous approaches that either rigidly copied expert solutions or relied solely on final answer correctness, SRL trains models to generate internal reasoning processes before committing to each action. This approach provides more detailed feedback throughout the problem-solving process rather than waiting until the final answer.
The method works by decomposing expert solutions into sequences of intermediate actions. For each step in solving a problem, the model first generates an internal monologue explaining its thought process, then produces the actual action. The training compares the model's action against the expert's corresponding action using sequence similarity scoring, providing immediate feedback at each step. This creates a dense reward signal that guides learning more effectively than methods that only evaluate final answers.
Experimental results demonstrate significant improvements. On mathematical reasoning benchmarks including AMC23, AIME24, AIME25, and Minerva Math, SRL achieved performance gains of 3.0-3.7% over baseline methods. The approach proved particularly effective for problems where traditional reinforcement learning methods struggled because correct solutions were rarely sampled during training. The researchers also extended SRL to software engineering tasks, where it achieved a 14.8% success rate in generating correct code patches—a 74% improvement over previous methods.
This training method matters because it enables smaller, more accessible AI models to tackle problems previously reserved for larger, more resource-intensive systems. The stepwise approach allows models to develop flexible reasoning patterns, including upfront planning, on-the-fly adjustments, and reflective verification. This could lead to more reliable AI assistants for technical fields like mathematics, programming, and scientific research where complex reasoning is essential.
The approach does have limitations. It requires the AI model to already possess basic competence in following instructions and the training data must contain properly structured step-by-step solutions. Additionally, the method's effectiveness depends on the quality of the expert demonstrations used for training. The researchers note that while SRL reduces complexity, the resulting actions must still allow the model to achieve good performance with reasonable probability.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn