Reinforcement learning fine-tuning for large language models often fails due to mysterious instability, but researchers have pinpointed a surprising culprit: the numerical precision used in computing. This discovery offers a simple, effective fix that could make AI training more reliable and efficient, benefiting everything from chatbots to automated systems.
The key finding is that the widely adopted BF16 floating-point format, despite its advantages in pre-training, introduces errors that cause policies to diverge between training and inference. This mismatch leads to biased gradients and performance collapse. By reverting to the FP16 format, the researchers eliminated this issue, resulting in more stable training and superior outcomes on diverse tasks without altering model architectures or algorithms.
Methodologically, the team conducted offline analyses and rigorous tests using frameworks like VeRL and Oat. They compared BF16 and FP16 across various reinforcement learning algorithms, including GRPO and policy gradient methods, on models such as DeepSeek-R1-Distill-Qwen-1.5B and Qwen3-30B-A3B-Base. FP16's higher precision (10 mantissa bits versus BF16's 7) reduced numerical discrepancies, preventing error accumulation during autoregressive sampling. Loss scaling techniques, standard in modern frameworks, managed FP16's limited dynamic range effectively.
Results from the paper show that FP16 consistently outperformed BF16. For instance, in sanity-check tests, FP16 achieved up to 99% accuracy, while BF16 methods often collapsed early, peaking at only 73-88%. FP16 also reduced the KL divergence between training and inference policies by approximately 24 times, indicating better alignment. Experiments with Mixture-of-Experts models, LoRA adaptations, and larger dense models confirmed these benefits, with FP16 enabling smoother convergence and higher rewards.
In practical terms, this finding matters because it simplifies AI development. Engineers can adopt FP16 with minimal code changes, avoiding complex corrections that add computational overhead. For everyday users, it means more stable and capable AI assistants, as reliable training translates to better performance in applications like education, customer service, and content generation. The approach closes the deployment gap, ensuring models behave consistently in real-world use.
Limitations noted in the paper include the potential for FP16 to face challenges with extremely low precisions like FP8, though these are considered solvable. The study did not explore all possible hardware optimizations, and some framework-specific differences persisted, but the core advantage of FP16 remained clear across diverse settings.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn