AI Agents Struggle with Real Scientific Problems

Artificial intelligence systems that can use tools and solve complex problems are hitting a fundamental roadblock: they perform well on standard tests but fail when faced with novel scientific challenges. A new study reveals that even the most advanced AI models achieve less than 50% accuracy on tasks requiring genuine reasoning and verification, exposing a critical gap in their ability to generalize beyond familiar scenarios.

Researchers from the University of California, San Diego discovered that current AI agents, while impressive on conventional benchmarks, struggle significantly when tested on the Mathematics & Physics Adversarial Verification & Evaluation Network (MAVEN). This new benchmark specifically targets out-of-distribution problems—unseen scenarios that require true reasoning rather than pattern recognition. The findings show that models like GPT-5, GLM-4.5, and Grok-4, which perform well on established tests, experience marked performance drops when confronted with MAVEN's challenging scientific problems.

The team developed a framework called CoreThink Reasoner that addresses this generalization gap through a neuro-symbolic approach. This method adds a lightweight reasoning layer on top of existing language models, structuring problem-solving into distinct stages: buffering relevant information, synthesizing testable actions, and generating machine-interpretable invocations. The system maintains separation between planning and execution, preventing unintended effects while ensuring auditability.

Results across multiple benchmarks demonstrate CoreThink's effectiveness. As shown in Figure 3, the framework outperforms baseline models by 5-30% across domains including airline services, retail, and telecommunications. Most notably, CoreThink achieves these gains while operating at approximately one-tenth the cost of leading models, making advanced AI capabilities more accessible for research and development.

The MAVEN benchmark itself represents a significant advancement in AI evaluation. Unlike traditional tests that focus on final answers, MAVEN emphasizes the entire problem-solving process. Each problem requires agents to perform extended sequences of tool calls while maintaining verification and state management. For example, a physics problem might involve computing derivatives, solving equations, and verifying results across multiple steps—mirroring real scientific workflows where the process matters as much as the outcome.

This research has immediate implications for deploying AI in scientific and engineering applications. Systems that can reliably reason through complex problems while using external tools could accelerate research in fields ranging from drug discovery to materials science. The cost efficiency of the CoreThink approach also lowers barriers for academic institutions and smaller organizations to experiment with advanced AI capabilities.

However, the study identifies several limitations. Current AI systems still struggle with tool selection errors, missing verification steps, and numerical instability when faced with challenging parameters. The research also notes that while CoreThink improves generalization, it doesn't completely solve the fundamental challenge of creating AI that can reliably reason across all novel scenarios.

The work underscores the importance of developing evaluation methods that go beyond static correctness checking. By focusing on process fidelity and out-of-distribution performance, researchers can better assess whether AI systems are developing genuine reasoning capabilities or simply memorizing patterns. The open-source release of MAVEN and CoreThink aims to foster collaborative development of more robust, interpretable AI agents capable of handling the unpredictable challenges of real-world scientific discovery.

AI Agents Struggle with Real Scientific Problems

About the Author

Guilherme A.