AI Mammography Models Fail Without Clean Data

TL;DR

A new framework fixes messy mammography datasets, exposing biases that hurt AI accuracy and making breast cancer screening tools fairer for everyone.

Breast cancer is the most common cancer among women and a leading cause of cancer-related deaths worldwide. Regular mammography screening is the most effective tool for early detection, but artificial intelligence (AI) systems designed to assist in this process often perform poorly when applied to new populations or clinics. A new study reveals that this failure stems from profound inconsistencies in mammography data, which introduce biases that severely compromise AI generalizability. The research introduces MammoClean, a framework that standardizes and harmonizes diverse datasets to quantify and mitigate these biases, paving the way for more equitable and effective AI in healthcare.

The key finding is that dataset-specific biases—such as variations in breast density distributions, annotation styles, and image acquisition protocols—directly degrade AI model performance. For instance, when trained on corrupted data with flipped laterality or intensity issues, models exhibited significant performance drops compared to those trained on curated data. In evaluations using ResNet18 architectures, maintaining uniform laterality improved results, while random flipping led to declines, underscoring the impact of data inconsistencies.

Methodology involved developing MammoClean as a modular, open-source pipeline that processes mammography data through stages: case selection, standardization of imaging and metadata, and unified storage. It checks for common issues like laterality mismatches, where images are incorrectly labeled for left or right breasts, and intensity flipping, where background and foreground values are inverted. The framework was applied to three public datasets—CBIS-DDSM, TOMPEI-CMMD, and VinDr-Mammo—selected for their accessibility and annotations, with steps including normalization of dynamic ranges and correction of flipped images to ensure consistency.

Results analysis, referencing figures from the paper, shows substantial distributional shifts across datasets. For example, breast density categories varied significantly: CBIS-DDSM had a majority in category B, while TOMPEI-CMMD and VinDr-Mammo were dominated by category C, reflecting regional and ethnic differences. Label imbalances were also evident; in VinDr-Mammo, over 90% of cases were labeled as BI-RADS 2, skewing training. The study quantified that about 28% of CBIS-DDSM and 23% of VinDr-Mammo images had flipping artifacts, and correcting these improved model robustness in tasks like malignancy classification.

Contextually, this work matters because mammography AI must perform reliably across diverse clinical settings to support early cancer detection. By harmonizing data, MammoClean enables fairer comparisons of AI methods and helps develop systems that are less sensitive to population-specific biases. This is crucial for reducing healthcare disparities, as variations in data can lead to AI tools that work well in one region but fail in another, potentially missing cancers in underrepresented groups.

Limitations noted in the paper include that MammoClean reduces unnecessary variability but cannot replace the need for larger, more diverse datasets. It also does not fully eliminate inherent biases from clinical practice, such as those arising from different radiologist interpretations or incomplete metadata. Future work should focus on subgroup-specific performance assessments and integrating multimodal data, like family history or prior exams, to better align AI with real-world clinical decision-making.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn