Artificial intelligence systems have struggled to generate precise descriptions of human movements in videos, often producing vague or incorrect captions that limit their use in fields like healthcare, sports analysis, and robotics. A new study introduces a method that significantly improves this capability, enabling AI to describe fine-grained motions such as facial expressions and hand gestures with high accuracy, which could enhance applications from virtual assistants to automated video indexing.
The researchers developed the Motion-Augmented Caption Model (M-ACM), which reduces errors in motion captioning by integrating a motion-aware decoding approach. This framework processes video inputs through dual pathways: a standard visual pathway and a specialized motion pathway that uses ViTPose-based sampling and SMPL-based mesh recovery to highlight dynamic human movements. By comparing outputs from both pathways, the model corrects inaccuracies, such as misidentifying body parts—for example, it can distinguish a hand manipulating a basketball instead of incorrectly labeling it as a foot, as shown in Figure 1 of the paper.
To achieve this, the team employed a synergy-based decoding mechanism that calculates scores from both visual and motion representations, pruning tokens that show low agreement to minimize hallucinations. The method was implemented using foundation models like Qwen2 and fine-tuned with a two-layer MLP projector, keeping the vision encoder frozen to maintain reasoning capabilities. Experimental setups involved training on the Insight Dataset and evaluating with metrics such as BLEU and CIDEr, using A100 GPUs over 35 hours for comparable model training.
Results from the paper demonstrate that M-ACM outperforms existing models, with a 3.7-fold improvement in BLEU-4 scores and a 1.5-fold gain in CIDEr compared to the best baseline. On the HMI-Bench benchmark, which assesses motion-focused captioning, the model showed a 40% improvement in understanding motivations and a 260% boost in detecting micro-expressions and emotions. Ablation studies confirmed that the motion-aware components contributed to a 38.6% increase in detail accuracy and a 35% rise in judgment precision, as detailed in Table 5 and Figure 2.
This advancement matters because accurate motion captioning can improve real-world technologies, such as enhancing video search engines, aiding in physical therapy monitoring, and supporting safer human-robot interactions. By providing detailed descriptions of movements, the model bridges a gap in AI's ability to interpret nuanced human behaviors, making it more reliable for everyday applications.
Limitations noted in the study include the model's reliance on the Insight Dataset, which, though comprehensive, may not cover all real-world scenarios, and the computational overhead from dual-path processing, which increases inference time. Future work will focus on expanding dataset annotations in SMPL format to train the motion encoder further and refining the synergy mechanisms for broader motion types.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn