AIResearch AIResearch
Back to articles
Science

AI Agents Struggle to Work Together

A new study reveals that top AI models often fail at simple teamwork, exposing a critical 'collaboration gap' that could hinder the future of autonomous systems.

AI Research
November 05, 2025
3 min read
AI Agents Struggle to Work Together

As artificial intelligence systems increasingly operate in teams, their ability to collaborate effectively becomes crucial for real-world applications. Researchers from EPFL and Microsoft Research have uncovered a surprising weakness in today's leading AI models: they often perform dramatically worse when working together than when working alone, even on simple tasks. This 'collaboration gap' represents a fundamental challenge for deploying AI in complex, multi-agent environments where cooperation is essential.

The key finding demonstrates that AI models that excel individually frequently degrade substantially when required to collaborate. In experiments with 32 leading open- and closed-source language models, researchers observed that high-performing models often experienced significant drops in performance when paired with identical copies of themselves. For instance, some distilled models that solved mazes well alone failed almost completely in certain pairings, revealing that collaboration represents a distinct capability axis not captured by current training approaches.

To measure collaborative capabilities, the researchers developed a novel maze-solving benchmark that isolates collaboration skills while allowing scalable automated evaluation. The methodology involved creating 6×6 mazes where each agent received only half the maze information, with random cells obfaphuscated as '?' symbols. Agents had to communicate through natural language dialogue to combine their partial knowledge and navigate to the goal, following rules that required mutual agreement before each move and imposed no output-format constraints to preserve ecological plausibility.

Results from thousands of automated evaluations showed consistent collaboration failures across model types. Performance metrics revealed that most models experienced substantial drops when moving from solo to collaborative settings, with distilled models particularly affected. The data showed that small models designed for efficiency often failed completely in certain pairings, while even powerful models like GPT-5 demonstrated significant performance degradation when collaborating. Qualitative analysis of dialogue transcripts revealed fundamental communication breakdowns, including failures in coordinate grounding where agents used different reference systems without establishing mutual understanding.

The implications extend beyond academic interest to practical AI deployment. As organizations increasingly rely on multi-agent systems for tasks ranging from customer service to scientific discovery, this collaboration gap could undermine real-world applications. The findings suggest that current training strategies fail to develop robust collaborative capabilities, potentially limiting AI's effectiveness in team-based scenarios. This is particularly concerning given the massive investments in AI-agent infrastructure and the shift toward agentic AI systems composed of multiple, independently developed components.

However, the research also identified a promising mitigation strategy called 'relay inference.' By having a stronger model initiate the collaboration and 'seed' the first few steps before handing off to a weaker partner, researchers were able to close much of the performance gap. This approach proved more effective than having stronger models intervene later to correct course, suggesting that initial grounding interactions are critical for successful collaboration.

The study acknowledges limitations, including the simplified nature of maze environments and potential confounding factors from model non-determinism. The researchers conducted extensive sensitivity analyses to ensure grading consistency across different models and found limited evidence of evaluation biases. Nevertheless, the collaboration gap observed in this controlled testbed likely represents a lower bound for real-world scenarios, where collaborative challenges would be more complex and consequential.

Original Source

Read the complete research paper

View on arXiv

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn