AI Interpretability Tools Can Mislead Researchers

TL;DR

Causal interventions in neural networks create unrealistic outputs, undermining AI explanations and driving a new method to keep them accurate.

Understanding how artificial intelligence systems make decisions is crucial for building trust and safety in AI, but a new study reveals that common techniques used to probe these systems can create misleading results. Researchers have found that causal interventions, which manipulate internal neural network states to test hypotheses about AI behavior, often produce representations that diverge significantly from the model's natural data distribution. This divergence can lead to incorrect conclusions about how AI models work, potentially undermining efforts to interpret and control complex AI systems.

The key finding is that causal interventions, such as activation patching and Distributed Alignment Search (DAS), frequently shift neural network representations out-of-distribution. As shown in Figure 2, methods like mean-difference vector patching, sparse autoencoder projections, and DAS result in intervened representations (light points in scatter plots) that differ from natural ones (dark points), quantified by increased Earth Mover's Distance (EMD). This means that when researchers alter AI internal states to study mechanisms, the resulting states may not reflect how the AI normally operates, casting doubt on interpretability claims.

Methodologically, the study employed empirical tests on transformer models, measuring divergence using EMD between natural and intervened representations. The researchers also provided a theoretical framework, categorizing divergences as 'harmless' or 'pernicious'. Harmless divergences occur within the network's decision boundaries or null-spaces, where changes do not affect output behavior, as illustrated in Figure 1(b) top panel. In contrast, pernicious divergences activate hidden pathways or cause dormant behavioral changes, leading to misleading confirmations or undetected output flips, shown in Figure 1(b) bottom panel and detailed in Section 4.2.

Results analysis indicates that pernicious divergences can flip AI decisions by activating units that are silent under natural conditions, as demonstrated in a simulated two-layer circuit where mean-difference patching altered predictions. The study also adapted the Counterfactual Latent (CL) loss from prior work to mitigate this issue. By regularizing interventions to stay closer to natural distributions, the CL loss reduced EMD along causal dimensions and improved out-of-distribution performance in synthetic tasks, as seen in Figure 3(d). This approach helps ensure that intervened representations remain realistic, preserving the interpretive power of causal methods.

In real-world contexts, these findings matter because unreliable interpretability could lead to flawed AI safety assessments or misguided regulations. For instance, if interventions in self-driving car AI or medical diagnosis systems produce unrealistic states, researchers might incorrectly identify critical mechanisms, risking real-world failures. The study highlights that while some divergence is acceptable for certain claims, pernicious types undermine the goal of fully understanding AI decision-making.

Limitations of the work include its confinement to narrow, synthetic settings and the inability to reliably classify all representations as harmless or pernicious. As noted in the discussion, future research should explore self-supervised methods to better detect and mitigate divergence, ensuring that interpretability techniques can scale to complex, real-world AI systems without introducing artifacts.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn