Fairer Medical AI Without Retraining: How It Works

TL;DR

A new method cuts bias in medical image analysis without costly retraining, making equitable healthcare AI faster and more scalable.

Multimodal large language models (MLLMs) are revolutionizing medical image reasoning by combining visual and textual data for tasks like disease diagnosis and report generation, but their deployment is plagued by fairness issues that can exacerbate healthcare disparities. These models often perform unevenly across demographic groups such as gender, race, and ethnicity, leading to biased outcomes that disadvantage underrepresented populations and erode trust in AI-assisted healthcare. Traditional debiasing s, which rely on large labeled datasets or fine-tuning, are impractical for foundation-scale MLLMs due to high costs, data scarcity, and risks like catastrophic forgetting. In this context, in-context learning (ICL) emerges as a promising, lightweight alternative that adapts models using contextual examples without modifying their parameters, yet conventional approaches fail to ensure fairness consistently. This research addresses these gaps by systematically analyzing ICL strategies and proposing a novel framework to enhance equity in medical AI, highlighting the urgent need for scalable solutions in high-stakes clinical environments.

To investigate how in-context learning can improve fairness, the researchers conducted a comprehensive analysis using two MLLMs—Qwen2.5-VL-7B and LLaVA-Med—on medical imaging datasets including FairCLIP Glaucoma and CheXpert Plus, which contain annotated data for attributes like gender, race, and ethnicity. They evaluated three common demonstration selection (DS) strategies: random selection, which samples examples uniformly; similarity-based selection, which picks examples semantically close to the query using cosine similarity in embedding spaces; and clustering-based selection, which groups data into clusters via s like K-means and samples from each for diversity. Fairness was measured using Average Disparity (AD) across demographic attributes, with lower AD values indicating better equity, while performance metrics included accuracy, precision, recall, and F1-score. The analysis focused on how these DS s propagate demographic imbalances from the selected demonstrations to model predictions, using metrics like Max Diff to quantify data bias and its correlation with fairness outcomes.

Revealed that existing DS s fail to consistently improve fairness, with strategies showing unstable and often contradictory effects across demographic attributes. For instance, with the Qwen model, similarity-based selection achieved the highest accuracy at 65.3% but also the largest Ethnicity AD at 15.0%, while random selection yielded mixed AD values such as 4.14% for Gender and 8.53% for Ethnicity. Similarly, in LLaVA-Med, random selection resulted in a Race AD of 14.7% but an extremely high Ethnicity AD of 28.0%, demonstrating that optimizing for semantic relevance or coverage alone does not guarantee equitable outcomes. A key finding was the strong positive correlation (R = 0.67) between data bias, measured by Max Diff, and fairness disparity, where higher demographic imbalance in selected examples led to greater AD; for example, s with Max Diff values above 80% for Ethnicity corresponded to ADs of 8.5–15.0%. These outcomes underscore that conventional DS heuristics inadvertently propagate dataset biases, necessitating a fairness-aware approach to exemplar selection.

Motivated by these insights, the researchers developed the Fairness-Aware Demonstration Selection (FADS) framework, which constructs balanced and semantically relevant demonstrations through demographic-aware clustering and subgroup-level sampling. FADS first clusters the labeled data into groups using a pretrained encoder like Sentence-BERT, then subdivides each cluster into sub-clusters based on sensitive attributes and task labels to ensure uniform representation. For each query, it selects a balanced set of demonstrations by choosing an equal number of examples from each demographic subgroup—such as four subgroups for attributes like gender and race—prioritizing those with high semantic similarity to the query. This mitigates both data-driven and model-induced biases without requiring model retraining or fine-tuning, as it operates entirely within the in-context learning paradigm. Experiments showed that FADS effectively reduces demographic imbalances in the selected examples, with Max Diff values dropping to near zero for attributes like Race, compared to 67.30% for random selection, leading to more stable and equitable model behavior.

Of this research are profound for the deployment of AI in healthcare, as FADS offers a practical, scalable solution to enhance fairness without the computational overhead of traditional s. By improving equity across gender, race, and ethnicity, this approach can help reduce diagnostic disparities and build trust in AI systems, particularly in resource-limited settings where data annotation and model retraining are infeasible. The study highlights the potential of fairness-aware in-context learning as a general framework for other high-stakes domains like autonomous systems or cybersecurity, where biased predictions could have severe consequences. Moreover, the ability to maintain competitive task performance—for example, FADS achieved an accuracy of 66.4% on Glaucoma data while reducing Average Disparity to 5.75%—underscores its viability for real-world applications, encouraging broader adoption of equitable AI in medicine and beyond.

Despite its strengths, the research has limitations, including its focus on a limited set of sensitive attributes and datasets, which may not capture all real-world demographic complexities or intersectional biases. The authors note that FADS performance can vary with the number of demonstrations and dataset size, as smaller shot budgets or imbalanced candidate pools may constrain fairness improvements; for instance, in 4-shot settings, FADS reduced Gender AD to 2.80% but had higher Race AD at 7.35%. Future work should explore extensions to multiple sensitive attributes, adaptive retrieval strategies, and applications in more diverse medical imaging contexts to enhance generalizability. Additionally, the reliance on pre-existing demographic annotations could be a barrier in settings with missing data, suggesting a need for s that infer or handle incomplete attributes. These limitations point to ongoing s in achieving perfect fairness but also open avenues for further innovation in lightweight, ethical AI systems.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn