AI Agents Fail at Real Science Tasks, Study Finds

TL;DR

A new benchmark shows advanced AI systems struggle with basic scientific reasoning, exposing a major gap in current AI development.

Artificial intelligence systems that can solve complex scientific problems have long been a goal of researchers, but a new study reveals how far current technology still has to go. Researchers have created a specialized test called Reasoning With a Star (RWS) that evaluates how well AI systems can handle real heliophysics problems—the study of how the Sun affects Earth and space weather. show that even the most advanced AI models struggle with tasks that require maintaining consistent units, stating physical assumptions, and delivering answers in proper scientific formats, achieving accuracy rates below 45% across all tested systems.

When researchers tested various AI approaches on this new benchmark, they found that no single works best for all types of scientific reasoning. The study compared four different multi-agent coordination patterns against simple single-shot prompting, where an AI model tries to answer questions directly without any coordination. Google's Gemini 2.5 Pro performed best in single-shot testing but still only achieved 35.44% accuracy on the heliophysics problems. When researchers added coordination between multiple AI agents, performance improved but remained limited, with the best multi-agent system reaching just 44.31% accuracy.

The researchers developed their benchmark by converting real educational materials from NASA's Living With a Star summer school into a machine-readable format. They created 158 question-answer pairs covering three types of scientific responses: numeric answers requiring correct units (38 items), symbolic answers requiring proper mathematical expressions (52 items), and textual answers requiring accurate scientific explanations (68 items). Each problem includes the original question, intermediate reasoning steps showing how experts solve it, and the final answer with proper formatting requirements. The team then built an automatic grader that checks answers using unit-aware numerical tolerance, symbolic equivalence verification through computer algebra systems, and schema validation for proper formatting.

The evaluation , detailed in Table 2 of the paper, reveal important patterns about when different AI coordination strategies work best. For arithmetic-heavy problems like those in the GSM8K and MATH benchmarks, a simple self-critique approach called PACE performed best, achieving 93.41% and 81.51% accuracy respectively. For graduate-level science questions in the GPQA benchmark, a basic hierarchical workflow called HMAW worked best with 79.01% accuracy. However, for the heliophysics problems in RWS and coding tasks in HumanEval and SWE-bench, a more sophisticated systems-engineering approach called SCHEMA performed best, reaching 44.31%, 43.29%, and 63.23% accuracy respectively. These demonstrate that different scientific reasoning tasks require different AI coordination strategies.

This research matters because it reveals fundamental limitations in current AI systems' ability to perform genuine scientific reasoning. The heliophysics problems in RWS require more than just recalling facts—they demand that AI systems incorporate physical assumptions, maintain consistent units throughout calculations, and provide answers in proper scientific formats. The study shows that even when AI systems coordinate through multiple specialized agents, they still struggle with these basic requirements of scientific work. The researchers note that their support a core principle from systems engineering: complexity must be earned, not assumed. Adding more agents or coordination stages doesn't automatically improve performance unless the task specifically requires that additional structure.

The study acknowledges several important limitations in its current . The RWS benchmark contains only 158 problems, which represents a relatively small sample of heliophysics s. The researchers evaluated AI systems without giving them access to external knowledge through retrieval-augmented generation, meaning the models had to solve problems using only the information provided in each question. The paper also notes that their automatic grader, while sophisticated, sometimes requires additional verification through two separate AI agents when answers don't match exactly. Looking forward, the researchers plan to expand RWS with more problem sets and improve the benchmark's ability to identify specific failure types, such as unit mismatches, unstated assumptions, and formatting violations.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn