How AI Models Actually Learn from Human Explanations

TL;DR

New research reveals what happens inside AI models when trained on human-written explanations, and why 'good enough' data may not be enough.

In the quest to make artificial intelligence more transparent and trustworthy, researchers have long relied on human explanations—known as rationales—to understand whether models are learning for the right reasons or simply exploiting dataset shortcuts. A new study from computational linguists at Vrije Universiteit Amsterdam and the University of Göttingen reveals that our standard metrics for evaluating these explanations might be masking more complex realities than they reveal. The paper, titled "Learning from Sufficient Rationales: Analysing the Relationship Between Explanation Faithfulness and Token-level Regularisation Strategies," systematically investigates how rationales actually influence model behavior across different tasks and architectures, with that conventional wisdom about what makes explanations useful.

At the heart of this investigation is the metric of sufficiency, commonly used to measure how informative rationales are by comparing model confidence when using full inputs versus isolated rationales. The researchers reframe this concept as "contextual impact" (CI), where low CI suggests rationales alone are sufficient for predictions, while high CI indicates context words also play a significant role. Through experiments with four transformer models—BERT, Pythia, ModernBERT, and GPT-Neo—across six diverse datasets including sentiment analysis, hate speech detection, and argument mining, the team discovered that CI values vary dramatically between models even on the same data. For instance, BERT consistently showed higher CI across tasks, suggesting it relies more on context information, while ModernBERT exhibited lower CI, indicating greater reliance on rationale content.

Ology involved two distinct learning paradigms to test how models utilize rationale information. First, token classification measured models' ability to distinguish rationale tokens from context tokens. Second, attention regularization examined whether incorporating rationale information during training could improve classification performance. revealed no one-size-fits-all pattern: attention regularization showed positive effects in only 6-8 out of 14 task-model combinations, with BERT benefiting most consistently. Surprisingly, the researchers found that highly informative rationales (with low CI) did not necessarily consist of easily identifiable tokens, nor did good sufficiency scores reliably predict that rationales would improve classification performance.

Perhaps the most significant finding emerged from cross-domain argument mining experiments, where attention regularization using rationales dramatically reduced the performance gap between in-domain and cross-domain settings for BERT models. While in-domain performance saw minimal improvement (AR = 1.01), cross-domain performance jumped substantially (AR = 1.14), suggesting that explicit rationale guidance can help models generalize better to unfamiliar domains. The study also examined how different rationale aggregation strategies affect learning, comparing union (including disputed rationale words) versus intersection (only unanimously agreed words) approaches in hate speech detection. showed that incorporating multiple human perspectives through union aggregation generally produced lower CI and better model performance than strict intersection approaches.

The research reveals fundamental limitations in how we currently evaluate explanation quality. Instance-level analysis showed that high CI actually correlates with better predictions—contrary to initial hypotheses that low CI (high rationale informativeness) would lead to superior performance. This suggests that CI captures the complex interaction between rationales and their contexts rather than absolute rationale strength. The authors conclude that sufficiency/CI primarily measures how context words interfere with rationale information in the same input, and that the informativeness of rationales depends heavily on both the specific task and the model architecture processing them. These underscore the need for more nuanced metrics that can systematically capture the multifaceted relationship between explanations, context, and model learning processes.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn