Large Language Models (LLMs) like GPT-5 and Llama have revolutionized how we interact with technology, but they are far from perfect. These models often exhibit systematic errors on specific subsets of data, known as error slices, where they consistently underperform. For instance, a model might struggle to identify toxic comments targeting a particular demographic, or it might incorrectly associate cities with cold climates as being further north than warmer ones, as seen when GPT-5 mistakenly answered that Montreal is north of London. Identifying these error slices is crucial for understanding and improving model reliability, but it has traditionally required extensive manual annotation, which is labor-intensive and costly. In a groundbreaking paper presented at the NeurIPS 2025 Workshop on Reliable ML from Unreliable Data, researchers from the University of Waterloo and New York University have introduced a novel approach called active slice , which aims to automate this process with minimal human input. This leverages active learning to group errors that likely belong to the same slice, using limited access to an annotator to verify patterns, potentially transforming how we audit and refine AI systems in real-world applications.
Ology behind active slice is rooted in a formal problem definition that combines elements of machine learning and human-in-the-loop systems. The researchers consider a joint distribution over input text, labels, and slice memberships, where the goal is to output a slice membership function that can detect errors. They start with a trained classifier, a small annotated dataset, and a larger unlabeled dataset, along with a budget for active queries to an oracle, such as a human labeler. The active slice algorithm iteratively selects unlabeled examples to query based on strategies like uncertainty sampling, where it prioritizes data points where the model is least confident. For example, it might ask an annotator to confirm whether a specific prompt-response pair, like GPT-5 incorrectly answering that moving from Ulaanbaatar to Paris goes south, belongs to the same error slice as previous mistakes. This process is implemented in a flexible pipeline that allows for combinations of different base models, representations, classifiers, and active learning strategies, with the source code made available to support future research.
In their experiments, the team evaluated active slice on the Jigsaw Toxicity dataset, using Llama 3.1-8B as the base model and comparing two types of internal state representations: raw layer embeddings from the penultimate layer and sparse activations from a sparse autoencoder (SAE). They tested various active learning query strategies, including uncertainty-based s like Least Confidence and diversity-based approaches like Embedding K-Means, against a random sampling baseline. , detailed in Figures 2 and 3 of the paper, show that uncertainty-based strategies consistently outperformed others, achieving high accuracy with far fewer labels. For instance, on the 'disagree' slice, using SAE representations with an SVM classifier and Least Confidence queries achieved 83.0% detection accuracy with only 1,000 labeled examples out of over 12,000, compared to needing 3,500 for similar performance with raw embeddings. This demonstrates a dramatic reduction in annotation requirements, with some slices like 'female' and 'christian' being detectable with just a few hundred annotations due to clearer lexical cues.
Of this research are profound for the field of AI ethics and model development. By reducing labeling needs by up to 98% relative to full supervision, active slice makes it feasible to continuously monitor and improve LLMs in production environments, where error patterns can emerge dynamically. This approach not only enhances interpretability by uncovering systematic failures but also guides targeted data collection and model updates, potentially mitigating biases and improving fairness. For example, identifying that a model underperforms on toxicity detection for specific demographics could lead to more inclusive training data. Moreover, the use of high-quality representations like SAEs allows even simple models, such as SVMs, to remain competitive, lowering the barrier for practical implementation. This aligns with broader trends in mechanistic interpretability, where understanding internal model features is key to building trustworthy AI systems.
However, the study acknowledges several limitations that warrant further investigation. The performance of active slice varies significantly across different slice types; while identity-based slices with consistent lexical patterns are easier to detect, sentiment-based slices like 'disagree' and 'sad' require more annotations and show less improvement with limited data. This suggests that may struggle with heterogeneous or subtle error patterns, which are common in real-world scenarios. Additionally, the research focuses on a single dataset and model, leaving open questions about generalizability to other domains or larger-scale systems. The authors note that while uncertainty-based strategies are effective, they may not always capture the full complexity of error slices, and future work could explore hybrid approaches or more sophisticated query strategies. Despite these s, the paper lays a strong foundation for active slice , offering a practical toolkit for researchers and practitioners to enhance model reliability with minimal human effort.
Looking ahead, the integration of active slice into AI development pipelines could revolutionize how we approach model auditing and maintenance. As LLMs become more pervasive in critical applications, from content moderation to healthcare, the ability to efficiently identify and address systematic errors will be paramount. This research not only provides a ological advance but also underscores the importance of human-AI collaboration in building robust systems. By leveraging active learning to pinpoint where models fail, we can move towards more transparent and accountable AI, ultimately fostering greater trust in these technologies. from this study, published in the NeurIPS 2025 proceedings, mark a significant step forward in the quest for reliable machine learning, highlighting the potential of smart annotation strategies to unlock deeper insights into model behavior.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn