Artificial intelligence systems that retrieve information to answer questions often stumble when faced with queries that require analyzing entire datasets, such as identifying trends across thousands of documents. A new study reveals that even advanced models achieve only a 1.51% accuracy on these global reasoning tasks, highlighting a critical gap in AI capabilities for applications like scientific research and data analysis.
The researchers discovered that current retrieval-augmented generation (RAG) systems, designed to reduce errors in large language models (LLMs), excel at localized queries—like finding a specific fact in a document—but fail dramatically on global questions. These include counting entities, finding extremes, sorting rankings, or extracting top results from a corpus. For instance, asking "Which author has the most publications in artificial intelligence?" requires aggregating data across many sources, a task where top AI methods score poorly.
To address this, the team developed GlobalQA, the first benchmark for evaluating global RAG abilities. It contains 13,000 questions based on 2,000 resumes across 23 domains, ensuring tasks demand corpus-wide analysis. The benchmark covers four core types: counting (16.7% of tasks), extremum queries like max/min (33.9%), sorting (16.3%), and top-k extraction (33.9%). Unlike existing datasets that focus on single documents or small chains, GlobalQA requires traversing up to 50 documents per query, with over 42% of questions needing more than 20 documents. This setup mimics real-world scenarios where answers depend on parallel processing of dispersed data.
In experiments, the strongest baseline methods, including StandardRAG and graph-based approaches like HyperGraphRAG, achieved only a 1.51 F1 score on average. The researchers identified three main issues: chunking documents into fixed-size segments disrupts integrity, retrievers bring in noisy irrelevant data, and LLMs struggle with computations like comparisons and sorting. For example, splitting a document can separate metadata from content, leading to double-counting or omissions in results.
The proposed solution, GlobalRAG, integrates document-level retrieval to preserve coherence, an LLM-driven filter to remove noise, and specialized tools for tasks like counting and sorting. Using Qwen2.5-14B as the backbone, this method boosted the F1 score to 6.63—a 5.12-point improvement over the best baseline. Ablation studies showed that removing any component, such as the filter or tools, caused performance drops of up to 94.5%, underscoring their necessity. Cross-dataset tests confirmed that methods effective on local tasks collapsed on GlobalQA, with accuracy declines of 95–99%, indicating global reasoning is fundamentally different.
This breakthrough matters because it enables AI to handle complex data analysis in fields like academia, where researchers might need to identify top-cited papers, or in business, for summarizing trends from large reports. By improving accuracy on corpus-level queries, systems could support more reliable decision-making without manual data sifting.
However, limitations remain. The study notes that performance depends on retriever quality and document numbers, and errors can arise from incomplete filtering or LLM inconsistencies. Future work should explore scaling to larger corpora and refining tool integration to handle diverse data types.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn