Artificial intelligence systems often learn hidden biases from training data, leading to unfair decisions in real-world applications. A new study reveals that standard methods for transferring knowledge from large AI models to smaller ones can undermine efforts to make these systems fair and robust, posing risks for deployment in sensitive areas like hiring or healthcare.
The researchers investigated whether debiasing capabilities—techniques that reduce a model's reliance on spurious correlations—can be effectively transferred through knowledge distillation, a common process where a smaller student model learns from a larger teacher model. They found that, overall, debiasing capability is undermined after distillation. For example, in natural language inference and image classification tasks, student models often showed increased susceptibility to biases, with performance gaps on out-of-distribution data widening by up to 12.5 percentage points in some cases, compared to their teachers.
To assess this, the team conducted extensive experiments using models like BERT, T5, ResNet, and ViT across scales from tiny to large. They trained models from scratch and applied knowledge distillation, comparing performance on in-domain and out-of-domain datasets such as CelebA for facial attributes and MNLI for language inference. Metrics included accuracy and F1 scores, with a focus on spurious gaps that measure vulnerability to biases.
The results indicate that while student models can mimic teacher performance on standard data, they frequently fail to maintain robustness on challenging, biased subsets. For instance, when teacher and student models were of similar scale, debiasing transfer was more effective, but mismatches led to significant degradation. Internal mechanism analyses using techniques like Centered Kernel Alignment showed that activation patterns in mid to later layers diverged between teachers and students on biased data, explaining the performance drops.
This research matters because AI systems are increasingly used in critical decision-making, and ensuring they are fair is essential for trust and safety. If knowledge distillation compromises debiasing, it could perpetuate discrimination in applications like automated resume screening or medical diagnostics, where biased predictions have real-world consequences.
Limitations of the study include its focus on logit-based distillation methods and lack of exploration into scenarios with multiple teachers. The authors propose solutions like data augmentation, iterative distillation, and weight initialization to improve debiasing transfer, but these require further validation. Future work should investigate self-distillation and counterfactual augmentation to enhance robustness without relying solely on external teachers.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn