Multimodal AI systems that process both images and text exhibit a surprising preference for textual information, potentially limiting their ability to understand visual content effectively. This discovery reveals a fundamental flaw in how current AI models integrate different types of information, with implications for everything from medical imaging analysis to autonomous vehicle navigation.
Researchers from the University of Science and Technology of China and Shanghai Jiao Tong University have identified that multimodal large language models (MLLMs) systematically prioritize text over visual information during processing. This bias occurs because the AI's internal representation spaces for images and text remain fundamentally separated, causing the system to under-utilize visual evidence even when images contain critical information.
The research team employed a sophisticated analytical approach to examine how MLLMs process information. They studied two popular open-source models—LLaVA1.5-7B and Qwen2.5-VL-7B—across multiple benchmarks including MMMU and MMBench-CN, which test AI performance on science, technology, engineering, mathematics, humanities, and real-world scenarios. The researchers extracted and analyzed the "key vectors" that these models use to determine which information to focus on during processing, comparing how they handle visual versus textual inputs.
The results demonstrate a clear pattern of modality bias. Visualization techniques revealed that visual tokens form compact clusters separate from the more diffuse textual manifolds throughout the AI's processing layers. Quantitative analysis using Jensen-Shannon divergence and Maximum Mean Discrepancy metrics showed that the difference between how the AI represents images versus text (cross-modality divergence) far exceeds the variation within either modality alone. For LLaVA-1.5-7B, the image-text divergence reached 0.408, compared to only 0.012 for text-text comparisons within the same modality.
This bias has significant real-world implications. When an AI system processing medical images or analyzing security footage systematically favors textual descriptions over visual evidence, it may miss critical patterns or make incorrect judgments. The research shows that this isn't merely a data imbalance issue but stems from the fundamental architecture of how these models integrate different types of information. The simpler model with a linear adapter (Qwen2.5-VL-7B) showed the strongest bias, with peak divergence values reaching 1.054, indicating that architectural choices significantly impact how severely this bias manifests.
The study's limitations include its focus on specific model architectures and benchmarks, leaving open questions about how this bias might vary across different AI systems and applications. The researchers note that while their analysis provides strong evidence of the bias mechanism, developing effective mitigation strategies will require additional investigation into how to better align the representation spaces for different modalities within AI systems.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn