AI Now Explains Its Decisions Faithfully

Artificial intelligence systems often operate as black boxes, making it difficult to understand how they reach decisions, especially in critical areas like healthcare. A new method called FaCT (Faithful Concept Traces) addresses this by providing clear, inherent explanations for neural network decisions, ensuring that the explanations accurately reflect the model's reasoning without relying on approximations or external assumptions. This breakthrough enhances trust and interpretability in AI systems, making them more reliable for real-world applications.

Researchers discovered that FaCT decomposes neural network decisions into human-interpretable concepts, such as 'wheel' or 'yellow color' in image recognition, and traces these concepts faithfully across layers. For example, in identifying a school bus, the method shows that yellow color contributes 4.3% to the logit score, with all concept contributions summing up to the final decision. This approach builds on B-cos networks and sparse autoencoders to ensure that explanations are derived directly from the model's forward pass, avoiding the unfaithfulness common in post-hoc methods.

The methodology involves integrating sparse autoencoders into the neural network architecture to extract concept activations at intermediate layers. These activations are then used to compute concept contributions to the output logits through dynamic-linear transformations. This design allows for faithful attribution of decisions to specific concepts and visualization of these concepts at the input level, such as highlighting relevant pixels in an image. The process ensures that explanations are model-inherent, meaning they are part of the network's operation rather than added afterward.

Results from the paper demonstrate that FaCT maintains competitive performance on ImageNet, with accuracy drops of less than 3%, while significantly improving concept consistency. A user study with 38 participants showed that FaCT's visualizations increased interpretability scores by up to 0.5 out of 5 points compared to baseline methods. The proposed C2-score metric, which evaluates concept consistency without human annotations, revealed that FaCT concepts are more reliable, with scores improving from 0.09 to 0.37 in some cases. Additionally, concept deletion experiments confirmed that FaCT's importance measures lead to sharper drops in accuracy when key concepts are removed, indicating better faithfulness.

In practical terms, this means AI systems can now provide transparent explanations that help users understand why a decision was made, such as in medical diagnostics or autonomous driving. For instance, if an AI misclassifies a basketball image as a volleyball, FaCT can identify shared concepts like 'ball' or 'jersey' that contribute to the confusion, offering insights into model errors. This transparency is crucial for building trust and ensuring that AI decisions align with human expectations, potentially reducing risks in high-stakes environments.

Limitations include that some concepts may remain uninterpretable, as not all automatically discovered features align with human understanding. The paper notes that future work could focus on training models from scratch to encourage more interpretable concepts or exploring applications in other modalities like language. Despite these challenges, FaCT represents a significant step toward making AI explanations both faithful and accessible, paving the way for safer and more accountable intelligent systems.

AI Now Explains Its Decisions Faithfully

About the Author

Guilherme A.