Training large language models (LLMs) to follow human preferences and improve reasoning is a costly and inefficient process, often wasting computational resources on tasks that are either too easy or too hard for the model at a given stage. Researchers from Alibaba Group have developed BOTS, a new method that allows AI to dynamically select the most beneficial tasks during reinforcement finetuning, significantly boosting efficiency without adding substantial overhead.
The key finding is that BOTS uses a Bayesian framework to continuously estimate the difficulty of tasks as the model learns, ensuring it focuses on 'just right' challenges—those with a success probability near 0.5, which are most informative for learning. This approach avoids the inefficiencies of uniform sampling, where models spend excessive time on trivial or unsolvable problems, and outperforms existing methods that rely solely on pre-scheduled or single-source task selection.
Methodologically, BOTS integrates two types of evidence: explicit evaluations from direct task rollouts and implicit predictions of difficulty for unselected tasks, using an ultra-lightweight interpolation-based plugin. It employs Thompson sampling to balance exploration of uncertain tasks with exploitation of known beneficial ones, all grounded in Bayesian inference to adapt as the model evolves. The system updates task difficulty estimates using a generalized rule that fuses both evidence sources, with parameters controlling the balance and adaptability.
Results from experiments across math, code, and logic domains, using models like Qwen2.5-1.5B and 7B, show that BOTS consistently improves training efficiency. For instance, in math tasks, it achieved a 36% reduction in steps to reach baseline performance (TTB of 0.64) and a 5% gain in best-so-far performance (BSF of 1.05). The method maintained an effective task ratio—the proportion of tasks with non-trivial success probabilities—above 0.8, compared to below 0.4 for baselines, while adding less than 0.2% computational overhead.
In practical terms, this means AI development can become faster and cheaper, enabling more rapid advancements in applications like automated reasoning and code generation. For everyday users, it could lead to smarter AI assistants that learn more efficiently from feedback, reducing the time and energy required for training.
Limitations noted in the paper include that BOTS is primarily validated on binary-reward tasks, and its performance may depend on the choice of reference models for implicit evidence. Future work could extend it to non-binary rewards and adaptive parameter tuning to enhance robustness across diverse scenarios.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn