Researchers increasingly rely on AI assistants for scientific discovery, but a new evaluation reveals these tools frequently fail at basic research tasks. The study systematically tested popular AI models—GPT-4o, GPT-5, and Gemini-2.5-Flash—on core operations like finding papers, extracting information, and verifying claims, with concerning results that question their reliability for serious academic work.
The key finding shows consistent failures across multiple research operations. When asked to retrieve multiple papers simultaneously, the AI systems failed 48-98% of the time. For content extraction tasks—such as locating specific sentences or counting figures—failure rates reached 72-91%. In open-domain question answering about scientific topics, the models achieved precision scores below 0.32, meaning they missed over 60% of relevant literature while returning mostly incorrect information.
The methodology prioritized real-world conditions by testing commercial AI assistants through their standard web interfaces, exactly as researchers would use them. The evaluation covered 840 test cases across four key operations: paper retrieval, content extraction, question answering, and claim verification. Researchers created PaperAsk, a benchmark that tests these fundamental tasks using real academic papers from diverse scientific fields including physics, biology, chemistry, medicine, and computer science.
Analysis of the results reveals systematic patterns in how AI assistants fail. When retrieving multiple papers, models either refused to answer or fabricated citations—Gemini-2.5-Flash produced 35% fabricated arXiv IDs when asked for 10 papers. For content extraction, AI systems frequently returned information from semantically similar but incorrect sources rather than the target document. The models accessed full-text documents only 1-3% of the time, preferring to work with search snippets that often contained conflicting information from multiple sources.
These failures have immediate implications for researchers using AI tools. The tendency to expand searches beyond requested constraints introduces noise that compromises accuracy. In verification tasks, AI systems achieved only 12-18% success rates when working with standard web interfaces, compared to 88-92% in controlled environments with full document access. This suggests current commercial implementations prioritize responsiveness over accuracy, creating reliability gaps that could mislead scientific work.
The study identifies several limitations in current AI research assistants. The architectural trade-offs between speed and accuracy mean these systems often make premature conclusions without thoroughly examining documents. Additionally, the tendency to prioritize semantically relevant content over task-specific instructions makes them vulnerable to manipulation through poisoned text in publicly editable sources. The evaluation also found no correlation between failure rates and scientific domains, indicating these are fundamental limitations rather than field-specific issues.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn