AI Reads Eye Movements to Predict What You Do Next

TL;DR

A new system combines gaze and body motion data to narrate and anticipate human actions, making AI behavior prediction more accurate.

Artificial intelligence systems that understand human behavior have long focused on body movements, but a new approach reveals that eye gaze holds crucial, untapped information. Researchers from The Hong Kong University of Science and Technology have developed GazeInterpreter, a that parses raw eye gaze data and integrates it with body motion to generate comprehensive narrations of human activities. This work addresses a significant gap in human-aware AI, where prior s typically neglected the synergy between gaze and body motion, leading to incomplete interpretations. By bridging this gap, the system offers a more robust foundation for applications like robotics, virtual reality, and assistive technologies, where predicting and understanding human intentions is key.

The core finding of the research is that eye gaze, when combined with body motion, provides a superior basis for interpreting human behavior compared to using body motion alone. The researchers validated this by testing their eye-body-coordinated narrations on text-driven motion generation using the large-scale Nymeria benchmark, which includes 300 hours of daily activities. In experiments, when these narrations were used as input to the MotionGPT model, they significantly improved performance metrics: for example, on low-level activities like walking, the Frechet Inception Distance (a measure of realism) dropped from 7.458 to 6.801, and Top-1 accuracy (measuring text-motion alignment) increased from 0.052 to 0.102. These demonstrate that the integrated narrations capture finer details and contextual richness, enabling more accurate motion synthesis.

Ology behind GazeInterpreter involves a three-phase hierarchical process. First, a symbolic gaze parser converts raw gaze signals—composed of yaw and pitch coordinates—into symbolic events like fixations, saccades, and smooth pursuits using the Identification-by-Velocity-Threshold algorithm. This step abstracts noisy data into a machine-readable format. Next, a large language model (specifically Gemini-2.5-Flash) translates these events into textual gaze narrations, which are then integrated with body motion narrations within a sliding observation window to produce eye-body-coordinated narrations. The integration considers historical context to ensure temporal coherence, such as linking a gaze shift with a walking motion to infer actions like scanning for obstacles. Finally, a self-correcting loop iteratively refines the narrations based on quality dimensions like continuity, modality match, temporal coherence, and completeness, using a threshold-governed mechanism to filter out noise and hallucinations.

Analysis of shows broad applicability beyond motion generation. In downstream tasks, the eye-body-coordinated narrations led to significant improvements. For action anticipation, where the goal is to predict the next action from current context, using the narrations boosted cosine similarity from 0.459 to 0.506 and Action F1 score from 0.226 to 0.248 across all activity types. In behavior summarization, which requires generating high-level summaries from motion sequences, the narrations increased cosine similarity from 0.480 to 0.537 and ROUGE-L scores from 0.150 to 0.229. Qualitative examples in Figure 3 illustrate how the narrations add contextual depth: for instance, a raw narration like "The human is walking on the path" becomes "The human, walking on the path, carefully inspected it... ensuring a safe and even footing," highlighting the enhanced descriptive fidelity.

Of this research are substantial for real-world AI systems. By providing a more holistic view of human behavior, GazeInterpreter can enhance technologies in areas like proactive robotics, where anticipating human intentions is critical, or in virtual environments for more natural interactions. 's ability to generate detailed, intent-rich narrations from multi-modal data opens new directions for human behavior understanding, potentially improving safety and efficiency in collaborative settings. However, the study acknowledges limitations, such as its reliance on the Nymeria dataset, which, while large, may not cover all real-world scenarios. Additionally, the self-correcting loop, though effective, has computational costs and may not fully eliminate all inaccuracies in complex, high-level activities. Future work could explore scaling to more diverse datasets and refining the integration process for even broader applications.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn