AI Matches Experts on Hard Math Using Less Power

TL;DR

A new multi-agent system beats larger AI models on PhD-level science questions while cutting computing costs significantly.

Artificial intelligence systems have struggled with complex mathematical and scientific reasoning that requires both factual knowledge and step-by-step logical thinking. Now, researchers have developed a new approach that enables AI to tackle these challenging problems more effectively than even much larger systems. The breakthrough comes from organizing AI into specialized teams that work together like human experts solving difficult problems.

Researchers created SIGMA, a framework that coordinates four specialized AI agents to solve mathematical and scientific problems. Each agent focuses on a different aspect of problem-solving: factual accuracy, computational verification, logical reasoning, and completeness checking. The system achieved a 7.4% absolute improvement over existing methods on the MATH500 benchmark and outperformed larger models like GPT-4o and Claude-3.5-Haiku despite using significantly fewer computational resources.

The methodology involves decomposing complex problems into specialized reasoning tasks handled by different AI agents. The FACTUAL agent retrieves relevant theorems and definitions, the COMPUTATIONAL agent performs mathematical calculations, the LOGICAL agent establishes conceptual relationships, and the COMPLETENESS agent verifies that no steps or cases have been overlooked. These agents work independently but coordinate through a moderator that synthesizes their findings into a final answer. The system uses a technique called Hypothetical Document Enhancement to generate tailored search queries specific to each agent's current reasoning state.

Experimental results demonstrate SIGMA's effectiveness across multiple domains. On the MATH500 benchmark, the system improved performance by 3.6% over Search-o1 and 5.8% over Auto-TIR. More impressively, SIGMA outperformed GPT-4o by 8.1% and Claude-3.5-Haiku by 1.4% while being more than 10 times smaller in computational requirements. On PhD-level science questions from the GPQA benchmark, SIGMA delivered a 6.1% overall improvement, with gains of 9.3% in physics, 3.2% in chemistry, and 5.3% in biology questions. The performance versus size graph shows SIGMA variants forming a frontier where smaller models achieve better results than much larger systems.

This approach matters because it addresses a fundamental limitation in current AI systems: the inability to effectively combine different types of reasoning. For students, researchers, and professionals working with complex mathematical or scientific problems, this could lead to more reliable AI assistants that can verify their work and catch subtle errors. The efficiency gains are particularly significant, as they suggest that better organization of AI reasoning processes might be more important than simply building larger models.

The research acknowledges that while SIGMA demonstrates strong performance across multiple benchmarks, its effectiveness depends on the quality of available knowledge sources and search capabilities. The current implementation uses a heuristic-based moderator rather than a learned component, which may limit its ability to handle more complex conflicts between agent outputs. Future work will need to explore how this multi-agent approach scales to even more complex problem domains and whether similar benefits can be achieved in non-mathematical reasoning tasks.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn