AIResearch AIResearch
Back to articles
Science

AI Helps Scientists Review Research Without Losing Trust

A new framework combines AI speed with human oversight to make systematic reviews faster and more reliable, addressing concerns about AI hallucinations and bias in scientific literature analysis.

AI Research
March 27, 2026
4 min read
AI Helps Scientists Review Research Without Losing Trust

Systematic reviews are the backbone of scientific progress, synthesizing vast amounts of research to guide future studies and inform decisions. However, as the volume of published papers grows exponentially, the traditional for conducting these reviews—the PRISMA framework—struggles to keep up, often forcing researchers to limit their searches and risk missing crucial studies. A new approach, called L-PRISMA, integrates generative AI (GenAI) tools like large language models (LLMs) to automate parts of the process, promising to make reviews more comprehensive and efficient while preserving the transparency and rigor that science demands.

Researchers have developed L-PRISMA as an extension of the existing PRISMA framework, specifically designed to address s of scale and manual effort in systematic reviews. The key finding is that by adding a statistical pre-screening phase powered by AI, the framework can filter thousands of records to identify relevant studies without compromising reproducibility. This hybrid approach ensures that human reviewers focus on the most pertinent literature, while AI handles the bulk of initial screening, as detailed in the paper's proposed flow diagram (Figure 1). aims to overcome the non-deterministic nature of LLMs, which can produce varying or hallucinations, by incorporating deterministic statistical techniques to maintain consistency and auditability.

Ology behind L-PRISMA involves three main updates to the standard PRISMA process. First, a pre-screening phase uses semantic filtering, where researchers define a statement of intent for their review and calculate similarity scores between this statement and the titles and abstracts of retrieved records. For example, in a use case described in the paper, the intent statement focused on measuring textual similarity in educational assessments, and the S-BERT model was used to compute cosine similarity scores. This allows the distribution of scores to be modeled as a Gaussian mixture, helping to separate highly relevant from weakly relevant articles statistically, as formalized in Equation (1) and illustrated in Figure 2.

Second, during the screening phase, records are divided based on statistical thresholds: those with high similarity scores undergo manual review, while those with lower scores are processed by GenAI with structured prompts. In the use case, searching databases like IEEE and ACM yielded 1,303 records without domain constraints; after pre-screening, 60 records were flagged for manual screening and 989 for AI-assisted screening, with 182 excluded. This division, as shown in Table II and the analysis, demonstrates how the framework broadens search scope while keeping the workload manageable. indicate that L-PRISMA can handle large datasets efficiently, reducing the risk of excluding relevant studies that might be missed by restrictive keyword searches.

Third, the included phase reports studies reviewed both manually and by GenAI, ensuring transparency by documenting the specific LLMs and prompts used. The paper emphasizes that AI outputs are not perfect and must be human-moderated to check for consistency and avoid errors like hallucinations or bias amplification. This structured approach addresses core issues highlighted in Table I, such as reproducibility s and the risk of over-reliance on AI, by maintaining human oversight as the cornerstone of the review process. The framework's design aims to enhance ological robustness without requiring deep technical expertise from researchers, aligning with PRISMA's goal of accessibility.

Of L-PRISMA are significant for the scientific community, as it offers a pathway to accelerate evidence synthesis without sacrificing quality. By automating time-consuming tasks like screening and data extraction, researchers can conduct more comprehensive reviews, potentially uncovering overlooked studies and reducing the cognitive burden associated with manual processes. This could lead to more reliable survey papers that better inform research directions and policy decisions, especially in fast-moving fields where literature accumulates rapidly. The framework's emphasis on transparency and auditability also helps mitigate ethical concerns, such as authorship ambiguity and bias, which are critical in maintaining public trust in science.

However, the paper acknowledges limitations that must be addressed for broader adoption. The use case provided is a single example, and future research needs to apply L-PRISMA across different domains to validate its effectiveness and refine the statistical s. Additionally, the framework relies on current GenAI capabilities, which may evolve, requiring ongoing updates to adapt to new models and techniques. There is also a need for further investigation into domain-specific approaches to enhance reliability, as the paper notes that tools like SyRF or AMSTAR-2 are often limited to specific fields like medicine. These limitations highlight the importance of continued human involvement and ological vigilance to ensure that AI integration does not undermine scientific integrity.

Original Source

Read the complete research paper

View on arXiv

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn