AI Fails to Use Medical Evidence Correctly, Study Finds

TL;DR

Researchers found AI often ignores accurate clinical guidelines it retrieves, a flaw with serious implications for mental health tools.

Large language models are increasingly being integrated into healthcare, but a critical gap remains between their ability to access information and their capacity to reason with it correctly. This issue is particularly urgent in clinical settings, where decisions must adhere strictly to established protocols, and errors can directly impact patient care. The study, focusing on Written Exposure Therapy (WET) for post-traumatic stress disorder, demonstrates that simply providing AI with authoritative evidence does not guarantee it will follow therapeutic guidelines accurately, highlighting a fundamental for safe AI deployment in sensitive domains.

The researchers developed CARE-RAG, a benchmark to evaluate how AI models handle clinical evidence under controlled conditions. They found that while many models could answer multiple-choice questions correctly when given relevant passages, their reasoning often lacked grounding in the retrieved context. For instance, in tests with 20 state-of-the-art models, including small, large, and fine-tuned variants, accuracy on multiple-choice questions was high, but inference scores—measuring whether models based their answers on the provided evidence—varied significantly. This indicates that models might produce correct outputs without truly understanding or applying the clinical guidelines, a disconnect that could lead to misinterpretations in real-world therapy scenarios.

To assess this gap, the team created a dataset of clinician-validated questions derived from WET guidelines, including multiple-choice, yes/no, and open-ended formats. They manipulated the context provided to the models in three ways: relevant passages directly supporting the correct answer, relevant passages mixed with noisy distractors, and misleading passages that were plausible but incorrect. Using a FAISS-based retrieval system with cosine similarity, they ensured controlled access to evidence, allowing them to isolate reasoning fidelity from retrieval quality. The models were prompted to generate both answers and reasoning traces, which were then evaluated for accuracy and how well they incorporated the context, using automated scoring and human expert review.

, Detailed in Table 2, show that models like Llama-3.1-8B-Instruct and Gemini-2.5-Pro achieved high inference scores, suggesting better context sensitivity, but no model performed perfectly across all reasoning levels. Figure 2 illustrates that accuracy on multiple-choice questions improved with higher entailment scores, indicating stronger evidence-based reasoning, while yes/no questions remained less consistent, revealing weaknesses in binary decision-making. Expert evaluation by psychologists and psychiatrists further confirmed that models often misinterpreted subtle instructions, such as whether to provide feedback on patient writing in therapy sessions, underscoring the complexity of clinical reasoning even with clear guidelines.

These have significant for the use of AI in healthcare, especially in mental health applications where fidelity to therapeutic protocols is crucial. The study suggests that current AI systems may not be ready for unsupervised deployment in clinical settings without additional safeguards. To mitigate risks, the researchers recommend incorporating prompt engineering, reasoning scaffolds, and guardrail mechanisms to ensure models adhere to evidence under uncertainty. Future work could expand the benchmark to other RAG architectures and explore s like Agentic Context Engineering to optimize context retrieval, but for now, the gap between retrieval and reasoning remains a key barrier to trustworthy AI in medicine.

The study acknowledges several limitations, including reliance on LLM-as-judge scoring, which may introduce bias due to self-evaluation artifacts, and a small sample of clinical experts for validation, potentially limiting interpretive diversity. Additionally, the benchmark currently focuses on WET, and its may not generalize to other clinical domains without further testing. The researchers plan to address these by expanding to multiple clinician reviews and exploring ensemble evaluation s, but these constraints highlight the need for cautious interpretation and ongoing refinement in AI evaluation frameworks.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn