AI Agents That Improve Language Model Training Together

TL;DR

Multiple specialized evaluators replace single reward models, boosting math reasoning accuracy by up to 29.87% and making AI decisions more transparent.

Training large language models to follow human preferences often relies on a single reward model that assigns scores based on complex, sometimes conflicting criteria like factuality, helpfulness, and safety. This approach, while effective, can be opaque and prone to errors where the model learns to exploit the reward system without truly aligning with human intent. A new framework called CRM (Collaborative Reward Model) addresses these issues by replacing the monolithic reward model with a coordinated team of specialist evaluators, offering a more transparent and robust way to guide AI learning.

The researchers found that CRM significantly improves reasoning accuracy and stability compared to traditional single-reward models. In experiments on the GSM8K math dataset, the four-agent configuration achieved a 29.87% accuracy rate, up from a baseline of 0.08% for the base model without CRM. On the RewardBench benchmark, which tests multi-dimensional preferences, CRM enhanced reasoning scores from 0.598 to 0.690 in the best-performing variant. These gains demonstrate that multi-agent collaboration helps language models better balance competing objectives, such as logical correctness and conversational quality, without sacrificing safety or fluency.

Ology involves decomposing preference evaluation into domain-specific agents, each focusing on a particular aspect like reasoning validity, factual consistency, or format adherence. For example, a Quality Assessor provides fine-grained judgments on step-by-step reasoning, while a Data Synthesizer injects synthetic perturbations to improve robustness. These agents work alongside global evaluators, such as a ranker-based preference score and an embedding-similarity reward, which measure overall alignment with reference answers. A centralized aggregator fuses these signals into a single reward, balancing factors like step-wise correctness and repetition penalties, making it compatible with standard reinforcement learning pipelines like GRPO.

From Table 1 show that adding more agents leads to consistent improvements. With two agents (Data Analyzer and Data Optimizer), GSM8K accuracy reached 22.33%; with three agents (adding the Quality Assessor), it increased to 23.15%; and with four agents (including the Data Synthesizer), it peaked at 29.87%. The RewardBench(rerank) variant, which uses explicit preference modeling, consistently outperformed other aggregation s, indicating that structured feedback enhances discriminative reward shaping. The framework maintained stable performance across dialogue tasks, with Chat and Chat Hard scores showing only moderate changes, confirming that it does not compromise general conversational abilities.

Of this work extend to making AI training more interpretable and adaptable. By breaking down reward signals into understandable components, CRM allows researchers to diagnose why a model produces certain outputs, reducing the risk of reward hacking where models exploit scoring quirks. This modular design supports the integration of new evaluators as plug-in agents, enabling scalable improvements without additional human annotations. For real-world applications, this could lead to more reliable AI systems in fields like education or customer service, where transparent reasoning and safety are critical.

Limitations noted in the paper include the need for empirical tuning of reward weights, such as the coefficients in the collaborative reward equation, which may not generalize across all tasks. The framework relies on pre-trained evaluators, and its performance depends on the quality of these agents, potentially introducing biases if they are not well-aligned. Additionally, while CRM improves robustness, it may not fully eliminate all forms of reward exploitation, and further research is needed to test its scalability to larger models or more diverse datasets beyond math and reasoning tasks.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn