Why AI Fails at Understanding Images

Artificial intelligence systems that link images and text, like CLIP, have transformed how machines interpret visual data, enabling applications from automated captioning to content moderation. Yet, these models often stumble on basic reasoning tasks, confusing objects, attributes, and relationships in ways that limit their reliability. A new study reveals that this weakness stems from a fundamental flaw in how these systems learn, offering a path to more robust AI.

The researchers found that CLIP-based models, despite excelling at matching images and text, fail to grasp compositional structures—such as distinguishing 'a white cat' from 'a cat on white grass.' This occurs because the training process cannot differentiate between optimal and pseudo-optimal solutions. As shown in Theorems 7–9, pseudo-optimal encoders achieve the same alignment scores as true ones during pre-training but are insensitive to operations like swapping, replacing, or adding tokens in text descriptions. For instance, a model might treat 'bulb on grass' and 'grass on bulb' as equivalent, even though they describe different scenes.

To uncover this, the team developed a token-aware causal representation framework, building on structural causal models (SCMs) that treat text as sequences of tokens rather than single vectors. This approach, detailed in Theorem 5, extends block identifiability to the token level, proving that contrastive learning can recover modal-invariant variables. The methodology involved analyzing how CLIP's training objective, which minimizes alignment loss between image and text features, leads to non-identifiability—where multiple encoder solutions perform equally well on training data but differ in compositional reasoning.

Results from experiments on benchmarks like ARO and VALSE, referenced in Figures 3 and 4, show that up to 80% of hard negatives in these tests fall into the SWAP, REPLACE, or ADD categories defined by the theorems. For example, in Table 2, models trained with iterative applications of these operations saw accuracy improvements, such as a 1.84-point gain on CC3M data, confirming that the theoretical non-identifiability translates to real-world failures. The study also links these text-side issues to visual-side modality gaps, where image features do not align with nuanced language structures.

This research matters because it explains why AI systems sometimes misinterpret everyday scenes, affecting technologies like autonomous vehicles or medical imaging that rely on accurate vision-language integration. By identifying the root cause, the work suggests that improving hard-negative mining strategies—such as generating more complex examples through repeated operations—could enhance model robustness without requiring entirely new architectures.

Limitations include the focus on textual non-identifiability, with visual aspects less rigorously addressed. The paper notes that combining multiple operations may have diminishing returns, and the framework assumes ideal conditions like infinite data, which may not hold in practice. Future work could explore how these findings apply to other multimodal systems and real-time applications.

Why AI Fails at Understanding Images

About the Author

Guilherme A.