Building effective teams of artificial intelligence models is crucial for tackling complex tasks, but current methods often fail to identify which AIs work best together. A new study introduces a conversation-based approach that maps how language models interact, revealing synergistic groups without needing internal data or architectures. This method could streamline the design of multi-agent systems for applications from healthcare to research, making AI collaborations more efficient and powerful.
Researchers discovered that analyzing dialogues between large language models (LLMs) can identify clusters of models that collaborate effectively. By constructing a "language graph" where nodes represent models and edges reflect the semantic coherence of their conversations, the team applied community detection algorithms to find groups with high mutual affinity. These clusters consistently aligned with the models' known specializations, such as mathematics or medicine, and when used in teams, they outperformed randomly assembled groups and matched the performance of manually curated ones on benchmark tasks.
The methodology involved three phases. First, the researchers generated pairwise conversations among ten diverse LLMs, using system prompts to guide discussions on general, mathematical, or medical topics. Each conversation continued for up to five turns or until a termination token was issued. Second, they computed a relationship value for each pair by summing the cosine similarities of utterance embeddings from their dialogue, filtering out weak interactions to form a sparse graph. Finally, they applied the Louvain algorithm to detect communities, representing potential collaborative teams.
Results from the evaluation showed that teams formed through this graph-based method achieved high accuracy on benchmarks. For example, in mathematical tasks like GSM8K and MATH-500, the automatically identified math-focused cluster (including models like Mathstral-7B-v0.1 and Qwen2-Math-7B-Instruct) scored second-highest, closely rivaling manually grouped teams. Similarly, in medical tasks such as MedQA and MedMCQA, the medical cluster performed strongly, approaching the upper bound set by type-based groupings. The language graphs, illustrated in Figures 3-5, demonstrated that topic-specific priming was essential; without it, clusters were less coherent, but with focused prompts, the method reliably isolated specialized models.
This approach matters because it enables the automatic formation of competent AI teams without relying on proprietary model details, which are often inaccessible. In real-world scenarios, this could help organizations deploy multi-agent systems for tasks like data analysis or decision-making more efficiently, reducing the need for manual testing and expert curation. By leveraging intrinsic synergies, it enhances robustness and performance in diverse domains, from scientific research to practical applications in education and healthcare.
Limitations include the method's computational cost, which scales quadratically with the number of models, potentially hindering scalability to large sets. Additionally, the definition of a "good" conversation relies on cumulative similarity metrics, which might not capture all aspects of effective collaboration, such as progress in reasoning. Future work could explore alternative measures and integrate this approach with task-driven frameworks to create hybrid systems that combine top-down planning with bottom-up insights from interactions.
In summary, this study provides a foundation for automated multi-agent design by focusing on how AIs communicate, offering a practical tool for building better AI teams in an increasingly collaborative technological landscape.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn