Multi-agent AI systems, where large language models (LLMs) work together like a team, promise to tackle complex reasoning tasks more effectively. However, these systems often fail because one agent becomes lazy, leaving the other to do most of the work. This issue, identified in recent research, undermines collaboration and limits performance on challenges like mathematical problem-solving. The study introduces a solution that measures and corrects this imbalance, ensuring all agents contribute meaningfully and improving outcomes on benchmarks such as AIME25 and OlympiadBench.
Researchers discovered that in frameworks like ReMA, which use meta-thinking and reasoning agents, the reasoning agent frequently outputs trivial or blank responses. For example, in a case analyzing word arrangements like 'ELLIPSE', the reasoning agent copied summaries instead of performing calculations, shifting the burden to the meta-thinking agent. This lazy behavior arises from a bias in the training loss function, which unintentionally rewards shorter interactions, discouraging active participation. The team quantified this using causal influence metrics, showing that the meta-thinking agent's influence grew while the reasoning agent's diminished, leading to a collapse in effective collaboration.
To address this, the researchers developed Dr. MAMR, a method that removes the biased normalization in the loss function and introduces a Shapley-inspired approach to measure each agent's influence. This involves grouping semantically similar steps across rollouts and averaging their impacts, providing a robust estimate without extensive resampling. Additionally, a deliberation mechanism allows agents to discard noisy outputs and restart reasoning when necessary, preventing errors from propagating in multi-turn interactions. For instance, in tasks involving geometric calculations, this restart feature helped agents recover from mistakes and arrive at correct solutions.
Extensive experiments on datasets like MATH500 and AIME25 demonstrated that Dr. MAMR outperforms baselines, with performance gains of up to 8% on OlympiadBench Pass@1 and 7% on AIME25 Pass@16. The method stabilizes training, preventing the collapse seen in standard frameworks, and scales well with increased sampling in pass@K evaluations. However, limitations include reliance on semantic similarity thresholds for influence measurement and the need for hyperparameter tuning, which may affect generalizability across diverse tasks. This advancement not only enhances AI reasoning but also has real-world implications for developing more reliable collaborative systems in areas like education and automated problem-solving, where balanced agent contributions are crucial for accuracy and trust.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn