AI Unlearning Evaluations Are Flawed

Machine learning models often need to forget specific data points, such as for privacy compliance or removing biased information, but current evaluation methods may mislead researchers about their effectiveness. A new study reveals that relying on a single training seed—a common practice—can produce unreliable results, potentially overestimating or underestimating the performance of unlearning algorithms. This issue is critical as it affects trust in AI systems used for sensitive tasks like data protection and fairness.

The key finding is that machine unlearning, which aims to remove the influence of certain data from trained models without full retraining, is highly sensitive to the initial random seed used during model training. Researchers demonstrated that using only one seed leads to inaccurate assessments of unlearning metrics, such as forget-set accuracy (how well the model forgets targeted data) and retain-set accuracy (how well it remembers other data). For example, in experiments with the CIFAR-100 dataset, a single seed could show unlearning as effective, while multiple seeds reveal significant variability, with some cases even performing worse than the gold-standard retraining method.

Methodology involved training models with different seeds and applying unlearning algorithms like Random-Labels, Bad-Teacher, and LFSSD to each. The team used a ResNet-18 architecture, trained from scratch on datasets including CIFAR-100 and its 20 superclass version, with evaluations run on NVIDIA A100 GPUs. They compared results from a single training seed (common practice) against multiple seeds (their recommendation), analyzing metrics through variance decomposition to show how seed choice affects outcome distributions.

Results analysis, based on figures from the paper, indicates that single-seed evaluations can distort the true performance. In Figure 1, boxplots for retain-set and forget-set accuracies show that using 11 training seeds (as recommended) captures a wider range of outcomes, with black lines representing the 25% to 75% quantiles of the gold-standard Retrain method. For instance, in CIFAR-100 sub-class tests, forget-set accuracy varied substantially, highlighting that deterministic unlearning methods, which always produce the same output for a given seed, are particularly prone to this issue. Figure 2 further supports this with 2-Wasserstein distance metrics, illustrating larger discrepancies in distribution when only one seed is used.

Contextually, this matters because flawed evaluations could lead to deploying unreliable AI in real-world applications, such as systems handling GDPR-compliant data deletion or mitigating biases in automated decisions. For general readers, it underscores the importance of rigorous testing in AI development to ensure models genuinely forget sensitive information without compromising overall functionality. This research prompts a reevaluation of standard practices in machine learning audits.

Limitations noted in the paper include that the study compares only the common single-seed approach against the multiple-seed recommendation, without a broader analysis of other evaluation instances. Future work is needed to explore this further, as the current findings focus on specific datasets and unlearning methods, leaving open questions about generalizability to other scenarios.

AI Unlearning Evaluations Are Flawed

About the Author

Guilherme A.