Artificial intelligence systems that can anticipate human actions could transform fields from assistive robotics to autonomous driving. A new approach developed by researchers at the University of Hong Kong enables AI to learn from human eye movements during training, allowing it to better understand current activities and predict future actions without requiring eye-tracking data during actual use.
The researchers created Gaze-VLM, a framework that enhances vision-language models by incorporating human gaze patterns as a training signal. Unlike previous methods that required eye-tracking data during both training and deployment, this system only uses gaze information during training, making it practical for real-world applications where eye-tracking equipment might be unavailable.
The method works by converting raw gaze coordinates into heatmaps that represent where people look during activities. These heatmaps are then aggregated over short time windows and filtered to remove unreliable data points, creating robust supervision signals. During training, the AI model's attention mechanism is guided to align with these human gaze patterns using a mathematical technique called Kullback-Leibler divergence, which measures how closely the model's focus matches human visual attention.
Experimental results show significant improvements across multiple AI architectures. The system boosted performance by up to 11% for predicting future activities and around 7% for understanding current actions compared to baseline models trained without gaze regularization. When tested on five different vision-language model architectures including OpenFlamingo, LaViLa, InternVL, and OpenLLaVA, all showed consistent improvements, demonstrating the method's broad applicability.
The approach also reduced visual hallucinations—where AI systems generate descriptions containing objects or actions not present in the visual input. Human evaluation showed the gaze-regularized model reduced hallucination rates from 20.5% to 14.0%, making the AI outputs more reliable and trustworthy. Quantitative analysis confirmed that the trained models achieved 68% overlap with human gaze patterns, compared to just 42% for models without gaze regularization.
For real-world applications, this technology could enable assistive robots to better anticipate user needs in dynamic environments. The system's ability to operate without gaze data during deployment makes it suitable for wearable AI systems and human-machine collaboration scenarios where eye-tracking might be impractical. The modular design allows integration with various transformer-based vision-language models without requiring architectural changes.
The research does have limitations. The method depends on the quality of gaze data during training, and misaligned eye-tracking could lead to suboptimal performance. Additionally, the current implementation processes videos at one frame per second, which may limit sensitivity to rapid actions. The framework also shows reduced effectiveness for single-image understanding compared to video sequences, suggesting its full potential emerges when analyzing temporal patterns.
Future work could explore how different types of gaze behavior—such as searching versus question-answering patterns—might be optimized for specific tasks. The researchers have released their code and dataset to support further development of gaze-enhanced AI systems that more closely align with human visual attention and intention.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn