When patients turn to AI for medical advice, they might expect reliable information, especially in areas like women's health where guidelines evolve and personal factors matter. However, a new study shows that current large language models (LLMs) fall short in this critical domain, with no model achieving a high score on a specialized benchmark designed to test clinical accuracy and safety. The research introduces the Women's Health Benchmark (WHBench), which evaluates 22 AI models across 47 expert-crafted scenarios, revealing significant gaps in performance that could impact real-world patient care.
The key finding from the study is that even the top-performing AI model, Claude Opus 4.6, only reached a mean score of 72.1% on WHBench, with a 95% confidence interval of 69.6% to 74.4%. This score is normalized from a rubric that assesses responses across eight clinical dimensions, including safety and equity. More concerningly, only 35.5% of its responses were classified as fully correct, meaning they met an 80% correctness threshold. Other leading models, such as Claude Sonnet 4.6 and GPT-5.4, scored 67.1% and 66.8% respectively, with correct rates of 22.7% and 21.3%. indicate a capability ceiling not visible in simpler benchmarks, as the frontier models clustered tightly in the low to mid-60s, and performance dropped sharply for lower-tier models, with the bottom seven averaging 38.6%.
Ology behind WHBench involved designing 47 open-ended questions across 10 women's health topics, such as fertility, pregnancy, and contraception, each targeting specific failure modes like outdated guidelines or health equity gaps. These scenarios were developed with input from board-certified clinicians, including OB/GYN specialists and fertility nurses, to ensure clinical realism. Reference answers were authored independently by 4-6 experts per question, averaging 94 words, and models were evaluated in a zero-shot, closed-book setting using a standardized system prompt that instructed them to respond as a board-certified physician. A 23-criterion rubric across eight categories was applied, with asymmetric penalties that weigh safety failures more heavily, and responses were scored by AI judges (Claude Sonnet 4.6 as primary and GPT-5.4 as secondary) with a 'default to fail' philosophy.
Analysis of shows uneven performance across clinical dimensions. In terms of safety, harm rates varied widely, from 12.8% for Claude Opus 4.6 to 90.8% for Gemini 2.5 Pro, with criteria like urgency recognition high across models but contraindication awareness more variable. Completeness was a major gap, with models often omitting follow-up timelines and monitoring plans; for example, criterion B7 (follow-up monitoring) ranged from 65.2% to 0.0%. Equity emerged as a universal blind spot: pass rates for social determinants of health (criterion F18a) were between 0.7% and 19.1% across all models, while inclusive language (F18b) was much higher at 78.0% to 92.9%. Topic-level patterns revealed contraception as the most challenging area, with hormonal health showing the largest cross-model variance.
Of these are significant for real-world applications, as AI models are increasingly consulted for health information. The study highlights that high scores on exam-style benchmarks like MedQA do not translate to open-ended clinical counseling, with WHBench scores for comparable GPT models being materially lower. This gap underscores the need for clinician review and correction, as even top systems provide fully correct advice in only a fraction of cases. The persistent weakness in equity, where models fail to incorporate social determinants into recommendations, could exacerbate health disparities if AI advice reaches diverse populations without proper oversight.
Limitations of the benchmark include sparse representation in some areas like bone health and mental health, which limits per-topic precision. Judge agreement was moderate at the final label level with a kappa of 0.238, though higher for category-level structure at 0.538, and weaker on subjective dimensions like equity at 0.153, indicating room for sharper criterion operationalization. The benchmark is currently English-only and relies on AI judging with expert references rather than full clinician adjudication. Future work should expand question volume, increase coverage of underrepresented topics, add multilingual evaluation, and include prospective clinician scoring to address these gaps.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn