AI Fails to Understand Videos from Different Views

Artificial intelligence systems struggle to consistently interpret videos when the camera angle changes, a new study reveals. This limitation affects applications from security monitoring to autonomous navigation, where reliable video understanding is crucial. Researchers have developed a benchmark and a novel method to address this gap, showing that current AI models often misinterpret the same events when viewed from different perspectives.

The researchers found that Video Large Language Models (Video-LLMs) perform significantly worse when analyzing synchronized egocentric (first-person) and exocentric (third-person) video pairs compared to single-view analysis. In tasks like temporal verification—determining if an event occurs in a video—and temporal grounding—identifying when it happens—models achieved only about half the accuracy in cross-view consistency as they did in single-view scenarios. For example, in the EgoExo-Con benchmark, open-source models showed a 50% drop in consistency metrics, indicating they fail to maintain stable understanding across viewpoints.

To conduct this analysis, the team created the EgoExo-Con dataset, comprising 491 pairs of synchronized videos from sources like CharadesEgo, LEMMA, and Ego-Exo4D, with 3,178 human-refined queries. They evaluated multiple Video-LLMs, including general-purpose and time-aware models, by having them process the same events from different camera views and checking for consistent outputs. The methodology involved standard accuracy metrics, such as Intersection over Union (IoU) for grounding tasks, to measure how well models aligned predictions across views.

Results from the study, detailed in figures like Table 1 and Figure 4, show that models trained on single viewpoints do not generalize well to multi-view scenarios. Even when fine-tuned with synchronized data, performance improvements were marginal, and in some cases, models underperformed compared to those trained on a single view. For instance, supervised fine-tuning on both ego and exo views led to only slight gains, with one model showing an 8.1% gap in performance when trained on exocentric data alone. The researchers also introduced View-GRPO, a reinforcement learning framework that strengthens view-specific reasoning while encouraging alignment, which outperformed standard methods in consistency metrics.

This work matters because robust video understanding is essential for real-world technologies like surveillance systems, where cameras capture events from multiple angles, or in robotics, where agents must interpret environments from varying perspectives. Inconsistent AI responses could lead to errors in critical decisions, such as misidentifying actions in healthcare monitoring or navigation aids. The findings highlight a fundamental challenge in AI development: ensuring models grasp invariant event structures despite visual changes.

Limitations noted in the paper include the dataset's focus on specific domains like daily activities and skilled tasks, which may not cover all real-world scenarios. Additionally, the study points out that current models rely on view-specific biases rather than learning shared abstractions, and the effectiveness of methods like View-GRPO depends on the quality of reward functions, which can introduce uncertainties. Future research is needed to expand the benchmark and refine approaches for broader applicability.

AI Fails to Understand Videos from Different Views

About the Author

Guilherme A.