AI Models Fail at Medical Diagnosis

Artificial intelligence systems that promised to revolutionize medical diagnosis are falling short in real-world testing. New research reveals that foundation models—the same technology behind advanced language systems like ChatGPT—struggle with basic medical image analysis tasks that human pathologists perform routinely, raising serious questions about their immediate clinical utility.

Researchers from Mayo Clinic's Kimia Lab systematically evaluated multiple foundation models on thousands of medical images and found they consistently underperform expectations. The models showed poor accuracy, limited generalization across different medical centers, and concerning vulnerabilities to minor image variations that commonly occur in clinical settings.

The investigation used rigorous testing methods across multiple institutions and datasets. Researchers employed rotation tests to check geometric stability, measuring how well models maintained performance when images were rotated at different angles. They also developed a Robustness Index to quantify whether models clustered images by biological characteristics rather than by scanner or hospital origin. The team analyzed over 100,000 biopsy slides from 7,342 patients across 15 medical sites in 11 countries, comparing foundation models against traditional task-specific approaches.

The results revealed stark limitations. Models achieved only 40-42% accuracy on cancer subtype classification, with performance varying dramatically across organs—from 68% accuracy on kidney images to just 21% on lung tissue. Rotation tests showed most models lacked geometric stability, with performance dropping significantly when images were rotated. Only one model, PathDino, achieved reasonable rotation invariance with a score of 0.85, while others scored as low as 0.016. The Robustness Index analysis found that most models grouped images primarily by hospital or scanner type rather than biological characteristics, indicating they learned site-specific artifacts rather than meaningful medical patterns.

These findings matter because foundation models are being considered for deployment in healthcare settings where accuracy and reliability are critical. The poor performance and high computational costs—foundation models consumed 35 times more energy than traditional approaches—suggest they may not be ready for clinical use. The models' fragility to common image variations like staining differences, compression artifacts, and minor rotations poses real risks for medical diagnosis.

The research identifies several fundamental limitations. Foundation models struggle because medical images contain complex tissue structures that don't fit the single-object assumptions of traditional computer vision. The standard 224x224 pixel patches used by most models are too small to capture the mesoscale architecture and contextual relationships that pathologists rely on for diagnosis. Additionally, medical data scarcity and privacy constraints prevent the massive pretraining that foundation models typically require.

What remains unknown is whether these limitations can be overcome with current approaches or if entirely new methods are needed. The research suggests that true advancement may require developing models that see images the way pathologists do—using multi-scale, contextual, and biologically grounded approaches rather than simply scaling up existing architectures.

AI Models Fail at Medical Diagnosis

About the Author

Guilherme A.