Artificial intelligence systems that answer questions by retrieving external knowledge often struggle with complex scientific inquiries that involve multiple related parts. For example, a question like 'What are ALB gene mutations and their related diseases?' requires identifying different mutations and linking each to specific diseases, demanding evidence from various sources and multi-step reasoning. Most existing retrieval-augmented generation (RAG) systems are designed for single-intent questions, where each query has one canonical answer, leading to incomplete evidence coverage when faced with such multi-faceted queries. This limitation is particularly problematic in scientific domains where comprehensive answers are crucial for research and decision-making.
To address this , researchers have introduced the Multi-Intent Scientific Question Answering (MuISQA) benchmark, a new dataset designed to evaluate how AI systems handle questions with multiple correlated intents. MuISQA covers five scientific domains—biology, chemistry, geography, medicine, and physics—with 500 questions, each annotated for diverse sub-intents and corresponding answers. Unlike previous benchmarks that focus on single answers, MuISQA assesses systems across three key dimensions: query formulation, passage retrieval, and answer generation. This provides a finer-grained diagnosis of performance, revealing where systems fall short in capturing the full scope of complex questions.
Building on this benchmark, the researchers developed an intent-aware retrieval framework that enhances both retrieval diversity and evidence coverage. first uses large language models (LLMs) to hypothesize potential answers for a multi-intent question. For instance, for a question about Alzheimer's disease treatments, the LLM might generate hypothetical answers describing different drugs, their active molecules, and side effects. These answers are then decomposed into intent-specific queries, such as separate statements for each drug and its properties. Unlike traditional query-rewriting s that create similar variants, this approach injects distinct hypothetical information into each query, broadening the search intent. The retrieved document chunks for each query are aggregated and re-ranked using Reciprocal Rank Fusion (RRF), an algorithm that balances complementary evidence while reducing redundancy from focusing on a single intent.
Experiments on the MuISQA benchmark show that this framework outperforms conventional RAG approaches. According to Table 1, it achieves higher scores in Information Recall Rate (IRR), a metric measuring the coverage of retrieved evidence across subtopics, with an average IRR of 67.502 compared to 61.240 for naive RAG. In answer generation, it improves both Answer Accuracy (AA) and Answer Coverage (AC), with average gains of 2.5% and 3.0% over baselines. The framework also generalizes well to other datasets, such as HotpotQA and TriviaQA, where it enhances performance in multi-hop reasoning and intensive knowledge tasks. For example, on HotpotQA, it achieves a 7.7% increase in Exact Match score and a 9.5% boost in F1 score over recent state-of-the-art s, as shown in Table 2.
Of this research are significant for scientific research and beyond, as it enables AI systems to provide more comprehensive and accurate answers to complex questions. In fields like medicine, where questions often involve multiple factors—such as drug interactions, side effects, and mechanisms—this approach can help retrieve diverse evidence, reducing the risk of incomplete or biased information. 's ability to handle multi-intent queries also has potential applications in education, legal analysis, and technical support, where users frequently ask layered questions that require nuanced responses. By improving evidence coverage, the framework addresses a common pitfall in AI-assisted research, where systems might overlook alternative perspectives or complementary data.
However, the study acknowledges limitations, including the framework's dependence on the quality of LLM-generated hypotheses. Smaller models may produce less informative hypotheses, leading to weaker queries and reduced retrieval performance, as indicated in Table 4. Additionally, while shows robustness to hyperparameter choices, such as the RRF smoothing constant, optimal performance requires careful tuning. The researchers also note that imperfect hypothetical generation, sometimes seen as AI hallucination, can occasionally facilitate retrieval by guiding searches toward target passages, but this effect is not fully predictable and may introduce errors in some cases. Future work could explore integrating more domain-specific knowledge or refining the decomposition process to further enhance accuracy and reliability.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn