AI Tool Measures How Models Weigh Text vs. Images

TL;DR

A new information-theory method tracks each modality's exact role in AI decisions, exposing hidden biases in popular models.

Understanding how artificial intelligence systems process information from multiple sources, like text and images, has long been a for researchers. Multimodal models, which combine vision and language, often show an imbalance, over-relying on one modality while underutilizing the other, as noted in prior studies. This imbalance can lead to unreliable performance in tasks requiring joint reasoning, such as visual question answering or image captioning. A new framework developed by researchers at PES University addresses this by quantifying modality contributions in a principled way, offering clearer insights than previous accuracy-based s that conflate individual and interactive effects.

The key finding is that the researchers have created a metric based on Partial Information Decomposition (PID) that disentangles how much information each modality provides independently versus through interactions with others. This approach decomposes predictive information in internal embeddings into four components: unique information from text, unique information from image, redundant information shared by both, and synergistic information that arises only when both are considered together. By isolating unique contributions, the framework provides a normalized score, such as text contributing 55% and image 45% in balanced tasks, revealing the true influence of each modality without being skewed by model performance or cross-modal dependencies.

Ology leverages an algorithm based on the Iterative Proportional Fitting Procedure (IPFP) to compute these contributions scalably, without retraining models. The process involves extracting intermediate embeddings from pretrained models, discretizing them via clustering, and solving a convex optimization problem to estimate PID components. This inference-only analysis is parameter-free and computationally efficient, making it applicable to diverse models like BLIP, LLaVA, PaliGemma, and SmolVLM. The framework operates directly on internal representations, avoiding the pitfalls of perturbation-based s that rely on masking inputs or gradient-based explanations, which can be unstable or miss higher-order interactions.

From experiments across multiple benchmarks show consistent patterns. On balanced tasks like VQA and GQA, models exhibited near-equal contributions, with text around 55% and image 45%, as detailed in Table 1. For instance, BLIP scored 55.45% text and 44.55% image on VQA, while SmolVLM showed a higher text bias at 60.90%. In visually grounded tasks like NLVR2, image contribution increased, but SmolVLM remained text-oriented at 65.18%, indicating limited visual grounding due to its smaller capacity. Synthetic validation, as shown in Figure 4, confirmed the metric's accuracy, with additive fusion yielding equal contributions and weighted fusion shifting toward the dominant modality. Ablation studies in Table 2 revealed that fusion s affect balance, with text-to-image cross-attention overemphasizing text at 61.5%, while bidirectional fusion achieved a more equitable 54.7% image and 45.3% text.

Of this research are significant for developing more reliable and interpretable AI systems. By providing a performance-agnostic view of modality contributions, it helps identify biases, such as text dominance in smaller models, which can inform better model design and training strategies. For real-world applications, this could lead to AI that more accurately integrates multimodal data in areas like healthcare diagnostics or autonomous vehicles, where balanced understanding is critical. The framework's ability to capture cross-modal interactions also advances transparency, allowing researchers to debug models and ensure they leverage all available information effectively, rather than relying on spurious correlations.

However, the approach has limitations. The PID estimates are sensitive to noise in embedding spaces, and small variations in feature distributions can influence the inferred information components. As with other analyses, may not be fully immune to spurious correlations in data or model representations, potentially affecting attribution scores. Additionally, it assumes access to intermediate embeddings, which may not be available for closed-source or API-restricted systems. Addressing these s, particularly improving robustness to noise and mitigating spurious correlations, remains an important direction for future work, as noted in the paper's limitations section.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn