AIResearch AIResearch
Back to articles
Data

The Illusion of Interpretability: How Prototype-Based AI Explanations Can Mislead

In the high-stakes world of artificial intelligence, where models increasingly guide decisions in healthcare, finance, and autonomous systems, the demand for transparency has never been greater. A pop…

AI Research
March 26, 2026
4 min read
The Illusion of Interpretability: How Prototype-Based AI Explanations Can Mislead

In the high-stakes world of artificial intelligence, where models increasingly guide decisions in healthcare, finance, and autonomous systems, the demand for transparency has never been greater. A popular solution has emerged in the form of case-based reasoning networks, machine-learning models that make predictions by comparing inputs to prototypical parts of training samples, known as prototypes. These models are often hailed as 'interpretable by design' because they can explain each decision by pointing to the prototypes that contributed to the outcome. However, a groundbreaking study from researchers at Université Paris-Saclay and CEA reveals a critical flaw: these explanations can be profoundly misleading, undermining their reliability in safety-critical contexts. The paper, titled 'Formal Abductive Latent Explanations for Prototype-Based Networks,' demonstrates that multiple instances can lead to different predictions while receiving the same explanation, exposing a vulnerability that could have serious consequences in real-world applications.

To understand the problem, consider a prototype-based network like ProtoPNet, which is designed for image classification. During inference, an image is processed through an encoder to produce a latent representation, which is then compared to learned prototypes—each representing a key concept, such as 'emperor beak' or 'royal yellow crests' in a penguin classification task. The model computes similarity scores between latent components and prototypes, aggregates these into an activation vector, and uses a decision layer to output a class prediction. The explanation typically consists of the top-k activated prototypes, ostensibly showing why the image was classified a certain way. For example, a royal penguin might be identified 'because it shows a royal beak.' Yet, the researchers constructed a counter-example where an image with latent representation z′ yields activation a(z′) = [6. 7. 1. 8. 2.], leading to classification as an emperor penguin, while the same explanation ('because it shows a royal beak') still applies. This inconsistency stems from the fact that less active but crucial prototypes are omitted, rendering the explanation optimistic or misleading, as it fails to guarantee that all instances matching the explanation will receive the same prediction.

The core of the research lies in bridging the gap between prototype-based learning and formal explainable AI (FXAI), a rigorous framework that provides guarantees on explanation correctness. FXAI typically defines abductive explanations as subsets of input features sufficient to justify a model's decision, but it suffers from scalability issues due to expensive prover calls and operates at a pixel-level, which may not align with human interpretability. The authors propose Abductive Latent Explanations (ALEs), a novel formalism that expresses sufficient conditions on the intermediate latent representation of an instance to imply the prediction. An ALE is defined as a subset of latent features E that entails a precondition ϕE over the latent representation, ensuring correctness: ∀x ∈ F. ϕE (f(x), f(v)) ⇒ (κ(x) = c). This approach combines the inherent interpretability of prototype-based models with the formal guarantees of FXAI, moving explanations from the pixel-level to a more abstract, concept-driven space that humans can better understand.

To compute ALEs without relying on costly external solvers, the researchers developed three distinct paradigms. The first uses the triangular inequality property of distance functions to derive bounds on similarity scores between latent vectors and prototypes, propagating these into activation space constraints. The second employs a hypersphere intersection approximation, where latent vectors are seen as intersections of hyperspheres centered on prototypes, allowing for refined distance boundaries. The third paradigm mirrors the original ProtoPNet approach, iteratively adding the highest-activated prototypes until the explanation is verified. Algorithms were designed to generate subset-minimal ALEs, with a forward pass that builds candidate explanations and a backward pass that prunes unnecessary pairs, ensuring both sufficiency and minimality. Experimental studies on diverse datasets—including CIFAR-10, CIFAR-100, MNIST, Oxford Flowers, Oxford Pet, Stanford Cars, and CUB200—using ProtoPNet models with backbones like VGG, ResNet, and WideResNet, revealed that top-k explanations often require more than the standard 10 prototypes to guarantee decisions, indicating that most original ProtoPNet explanations are misleading. Moreover, incorrectly classified samples tend to have much larger ALEs, suggesting explanation size could serve as a proxy for model uncertainty or out-of-distribution detection.

Of this work are profound for the field of AI interpretability. By exposing the limitations of prototype-based explanations, the study s the notion that these models are inherently trustworthy, especially in high-risk domains where incorrect justifications could lead to catastrophic outcomes. The development of ALEs offers a path toward more reliable explanations, but the research also highlights a key limitation: the size of obtained explanations can be large, sometimes involving thousands of latent feature-prototype pairs, which may still be too complex for human consumption. This finding nuances previous claims of interpretability and underscores the need for further advancements in training s that yield more compact and meaningful latent spaces. Future directions include extending ALEs to contrastive explanations, applying them to other modalities, and using them to identify irrelevant components for network pruning, ultimately paving the way for AI systems that are not only powerful but also transparent and accountable.

Original Source

Read the complete research paper

View on arXiv

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn