A new study demonstrates that large language models (LLMs) can assist in checking whether scientific papers follow reporting guidelines, potentially easing the burden on peer reviewers. The research focused on the PRISMA 2020 statement, a widely used checklist for systematic reviews in medicine, which ensures transparent and complete reporting of evidence synthesis. Adherence checks are typically done manually by reviewers and editors, contributing to a heavy workload, and this study explores how AI can help automate part of this process without replacing human judgment. reveal that providing structured checklists to LLMs significantly improves their ability to assess adherence, though performance varies across models and requires expert verification for accuracy.
Researchers found that supplying structured PRISMA 2020 checklists in formats like Markdown, JSON, XML, or plain text boosted LLM accuracy to around 79%, compared to only 45% when no checklist was provided. This improvement was consistent across different models, with no significant differences between the structured formats themselves. The study evaluated ten LLMs, including GPT-5, GPT-4o, and Claude Opus 4.1, on a dataset of 108 Creative Commons-licensed systematic reviews from emergency medicine and rehabilitation fields. , detailed in Figure 2 of the paper, show that structured input led to higher sensitivity and specificity, making the models more reliable for identifying both reported and missing items in the reviews.
Ology involved a multi-phase approach, starting with parameter optimization on five randomly sampled reviews to set consistent evaluation settings. Key steps included converting full-text PDFs into structured text, identifying licensed articles for a shareable benchmark, and testing different prompt arrangements and token budgets. For instance, Claude Opus 4.1 was optimized with a 28,000-token thinking budget to balance accuracy and cost. The researchers then compared five input formats across ten LLMs using ten additional reviews, locking parameters to ensure reproducibility. In the validation phase, they applied the same s to an independent dataset of ten reviews from a different clinical domain, adding newer models like Gemini 3 Pro and GPT-5.1 to assess generalizability.
Analysis of showed that model performance varied widely, with accuracy ranging from 70.6% to 82.8% in the development cohort. High-sensitivity models like GPT-4o achieved up to 97.3% sensitivity but had lower specificity around 33.9%, meaning they often flagged items as missing when they were actually reported. In contrast, models like GPT-5 offered a more balanced profile with both sensitivity and specificity above 75%. The validation phase confirmed these patterns, with Grok-4 Fast reaching 83% accuracy. Item-level error analysis on 120 reviews using Qwen3-Max revealed systematic misclassifications: false negatives were common for items like data availability and funding sources, while false positives occurred across ological details. For example, as shown in Table 4a, the item for data availability had an 86.2% false negative rate, indicating models frequently missed this information.
Of this research are practical for the scientific community, suggesting that LLMs can be integrated into peer-review workflows as a screening tool to reduce manual effort. By using high-sensitivity models, reviewers can quickly identify potential missing items, then focus human attention on verifying them, especially for error-prone areas like metadata. The study notes that such AI assistance should be disclosed by all parties in the review process, following recommendations from the World Association of Medical Editors. However, the authors caution that current performance is insufficient for fully automated evaluation, and human experts remain essential to check flagged items before editorial decisions. This approach could streamline review processes while maintaining scientific rigor, as models provide structured rationales alongside binary decisions.
Limitations of the study include its focus on only two medical specialties, which may affect generalizability to other fields. The human reference labels used for comparison might have inconsistencies, potentially impacting performance metrics. Additionally, the research acknowledges that LLM performance is still not perfect, with false positives and negatives requiring expert oversight. Future work should explore ensemble s and advanced prompting strategies, and the public benchmark created in this study enables further evaluation of emerging models. Despite these s, offer a step toward more efficient scientific reporting, balancing AI automation with necessary human verification.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn