AI Models Can Sabotage Each Other's Decisions

When artificial intelligence systems combine multiple data streams like text, audio, and video, they can become vulnerable to internal sabotage where one component overrides others with misleading information. Researchers from Harvard University and MIT Media Lab have developed a simple diagnostic tool that exposes these hidden conflicts, revealing how AI models can actively mislead their own decision-making processes in critical applications like emotion recognition.

The key finding demonstrates that individual AI components within multimodal systems can sabotage overall decisions by providing high-confidence but incorrect information that overrides more accurate inputs from other sources. This phenomenon, termed "modality sabotage," occurs when one data stream—such as audio analysis—dominates the final output despite conflicting evidence from text or visual analysis that might point to the correct answer.

To uncover these dynamics, researchers created a lightweight evaluation framework that treats each data modality as an independent agent. For emotion recognition tasks, they separately analyzed text transcripts from speech, audio characteristics like pitch and tempo, and visual cues from facial expressions and body language. Each modality agent generated its own classification candidates with confidence scores, which were then fused into a final decision. This approach made the contribution of each component visible without requiring architectural changes to existing AI models.

The results revealed systematic patterns of sabotage across different datasets and AI backbones. In tests using GPT-5-nano and GPT-4-mini models on emotion recognition benchmarks (MER, MELD, and IEMOCAP), audio emerged as the primary saboteur, frequently overriding other modalities with high-confidence but incorrect predictions. The diagnostic method showed that while top-1 accuracy could drop significantly due to sabotage—from 0.36 to 0.30 in some cases—the AI systems often preserved correct answers among their top-5 hypotheses, indicating recoverable uncertainty. This suggests the models internally maintain multiple possibilities even when sabotage affects their primary output.

This research matters because it addresses a critical blind spot in increasingly complex AI systems used for applications ranging from mental health assessment to customer service. When AI components can sabotage each other without detection, it undermines reliability in sensitive domains where accurate emotion recognition is crucial. The diagnostic method provides a practical tool for developers to identify which components are most likely to mislead systems and implement safeguards like confidence weighting or human review for high-stakes decisions.

The study acknowledges limitations in establishing strict causality for sabotage events, since multiple agents may coincidentally support the same incorrect label. Additionally, the self-reported quality scores from AI components showed only weak alignment with actual correctness, suggesting current AI systems struggle to accurately assess their own reliability. The research focused specifically on emotion recognition tasks, leaving open questions about how sabotage manifests in other multimodal applications.

AI Models Can Sabotage Each Other's Decisions

Original Source

About the Author

Guilherme A.