New Loss Functions Fix Mislabeled AI Training Data

TL;DR

Waterloo researchers created Blurry Loss and Piecewise-zero Loss to catch label errors during training, outperforming existing methods.

In the high-stakes world of machine learning, where models are only as good as the data they're trained on, a persistent and often overlooked problem lurks: label errors. These mislabeled data points, where an image of a cat might be tagged as a dog or a handwritten digit '7' incorrectly labeled as a1, can wreak havoc on model performance, leading to inaccurate predictions and unreliable systems. While much attention has been paid to cleaning up noisy data after the fact, a new study from researchers at the University of Waterloo tackles this issue head-on by rethinking the very foundation of model training: the loss function. The authors describe label errors as central to the degradation of model accuracy, particularly in large-scale datasets like the BIOSCAN-5M insect collection, where human taxonomic labeling introduces inevitable mistakes. This work shifts the focus from post-hoc correction to proactive robustness, proposing novel loss functions designed to make models inherently resilient to corrupted labels during training, a move that could save countless hours of manual data cleaning and improve the reliability of AI systems across domains.

The core innovation lies in two new categorical loss functions: Blurry Loss and Piecewise-zero Loss, both inspired by but fundamentally different from the widely used Focal Loss. Focal Loss, introduced to handle class imbalance, emphasizes difficult-to-classify samples by weighting them more heavily during training. However, in datasets with label errors, these 'difficult' samples are often the mislabeled ones, causing models to fit to erroneous data. Blurry Loss flips this logic by de-weighting difficult samples, defined mathematically as BL(pt) = -(pt)^γ log(pt), where pt is the predicted probability for the ground-truth label and γ is a parameter controlling the degree of de-weighting. Piecewise-zero Loss takes a more aggressive approach, assigning zero loss and zero gradient to samples with pt below a cutoff parameter c, effectively ignoring them entirely during training. The underlying assumption is that samples with label errors will have low pt if the model has seen enough correct data to learn properly, allowing these loss functions to sidestep the noise without complex additional terms.

To validate their approach, the researchers conducted experiments on artificially corrupted versions of the MNIST and Fashion MNIST datasets, introducing label errors at rates of 10% and 20% to simulate real-world noise. They used a minimal convolutional neural network with 1.2 million parameters, trained for 10 epochs with the Adam optimizer, and incorporated a loss scheduling scheme where models started with conventional Cross Entropy Loss before switching to the proposed functions after a delay hyperparameter d. Performance was measured using the Confident Learning framework, which detects label errors by identifying samples with low probability for their labeled class and high probability for another class, with F1 scores serving as the primary metric. , detailed in bar charts and tables, showed that both Blurry Loss and Piecewise-zero Loss outperformed baselines of Cross Entropy and Focal Loss, especially at higher corruption rates. For instance, on MNIST with 20% corruption, Blurry Loss achieved an F1 score of 0.9845 with γ=0.4 and no delay, compared to 0.9708 for Focal Loss, while Piecewise-zero Loss reached 0.9827 with c=0.05 and a delay of 4 epochs.

Of this research extend far beyond academic benchmarks, offering a practical toolkit for improving data quality in machine learning pipelines. By making models robust to label errors during training, these loss functions could reduce the need for expensive and time-consuming manual data cleaning, particularly in large-scale applications like the BIOSCAN-5M dataset, where taxonomic disagreements and subtle visual differences exacerbate labeling s. The authors note that this approach not only enhances error detection but also paves the way for more reliable classifiers in fields such as entomology, where accurate identification can impact biodiversity assessments. Moreover, the simplicity of Blurry Loss and Piecewise-zero Loss—requiring only minor adjustments to existing training code—makes them accessible for widespread adoption, potentially benefiting domains from medical imaging to autonomous driving where label errors can have serious consequences.

Despite these promising , the study acknowledges several limitations that warrant further investigation. The experiments relied on artificially corrupted data, which may not fully capture the complexity of real-world label errors, such as those arising from taxonomic disputes in the BIOSCAN datasets. Additionally, the performance gains were more modest on Fashion MNIST, a more challenging dataset than MNIST, suggesting that the effectiveness of these loss functions may vary with data complexity and model architecture. The authors also highlight the need for comparisons with other robust loss functions, like Generalized Cross Entropy, to better understand their relative advantages. Future work aims to apply these s to realistic datasets with unknown error rates, potentially leading to improvements in classifier performance and even influencing scientific fields like taxonomy by identifying and correcting mislabeled specimens.

In conclusion, the development of Blurry Loss and Piecewise-zero Loss represents a significant step forward in the quest for robust machine learning, addressing the pervasive issue of label errors at their source. By reimagining loss functions to de-emphasize or ignore potentially mislabeled samples, this research offers a scalable solution that balances simplicity with efficacy, as evidenced by improved F1 scores in detection tasks. As datasets continue to grow in size and complexity, tools like these will be crucial for ensuring that AI systems can learn reliably from imperfect data, ultimately driving progress in both technology and science. The authors' focus on practical implementation, coupled with their acknowledgment of current constraints, sets the stage for ongoing refinement and real-world application in diverse fields.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn