AIResearch AIResearch
Back to articles
Science

AI Agents Struggle with Scientific Research Tasks

AI agents struggle with scientific research tasks despite advanced capabilities. New study reveals critical gaps in automated discovery and data analysis.

AI Research
November 14, 2025
3 min read
AI Agents Struggle with Scientific Research Tasks

Artificial intelligence systems designed to assist with scientific research still struggle with many core research tasks, according to a comprehensive new benchmarking study. The research reveals that while AI agents excel at some literature review tasks, they perform poorly at automating scientific discovery and data analysis.

The Allen Institute for AI team developed AstaBench, a rigorous benchmarking suite comprising 2,400 tasks across scientific domains. The benchmark tests AI agents on four key research categories: literature understanding, code execution, data analysis, and end-to-end discovery. Researchers evaluated 57 different AI architectures, including both specialized research agents and general-purpose systems.

Key findings show a stark performance gap across research tasks. For literature understanding, some agents achieved impressive scores around 80%, with the PaperFinder system performing particularly well. However, the same agents struggled with code execution and data analysis, scoring only around 25% on average. The most challenging category—end-to-end scientific discovery—remained largely unsolved, with even the best agents scoring poorly on tasks requiring complete research cycles from idea generation to final reports.

The study employed a novel cost-performance analysis, revealing that simply using the most expensive AI models doesn't guarantee better scientific assistance. The most economical model tested—ReAct with GPT-5-mini—scored 32% overall while costing over an order of magnitude less than top-performing models at just $0.04 per problem. This finding challenges the assumption that more expensive AI systems necessarily provide better scientific assistance.

Researchers identified several surprising patterns. Simple ReAct-style agents sometimes outperformed more complex, specialized systems, suggesting that sophisticated agent architectures may not always provide meaningful advantages for scientific tasks. The study also found that open-source AI models lag significantly behind commercial counterparts, with the best open-source system scoring only 11.1% compared to 53.0% for top commercial agents.

The benchmarking approach addresses critical limitations in previous evaluation methods. AstaBench provides standardized tools and environments that isolate agent capabilities from confounding factors like computational costs and tool access. The system includes the Asta Environment, which offers production-grade tools for accessing scientific literature and computational notebooks, enabling reproducible testing conditions.

For real-world scientific applications, these findings suggest researchers should carefully match AI tools to specific tasks. While AI agents can effectively assist with literature review and search, they remain unreliable for automating data analysis or driving scientific discovery. The performance-cost tradeoffs identified in the study provide practical guidance for researchers deciding which AI systems to deploy for different research needs.

The limitations section notes that even the best-performing agents struggle with complex reasoning required for genuine scientific discovery. The low scores on end-to-end research tasks indicate that AI systems cannot yet replace human scientists in formulating hypotheses, designing experiments, and drawing scientific conclusions. The benchmarking suite will continue to evolve as new AI capabilities emerge, providing an ongoing measure of progress in scientific AI assistance.

This research represents the most comprehensive evaluation to date of AI systems for scientific research, offering both sobering assessments of current limitations and clear pathways for future improvement. As AI continues to transform scientific practice, such rigorous benchmarking will be essential for understanding what these systems can—and cannot—reliably accomplish.

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn