AI Cannot Read Body Language in Real-World Tests

TL;DR

A new benchmark shows AI systems struggle with everyday nonverbal cues like gestures and expressions, falling far behind human performance.

Artificial intelligence systems are increasingly embedded in our daily lives, from virtual assistants to social robots, yet their ability to interpret the subtle nonverbal cues that underpin human social interaction remains critically underdeveloped. A new study introduces M OTION 2M IND, a comprehensive framework for evaluating Theory of Mind (ToM) in AI, specifically focusing on the interpretation of nonverbal communication (NVCs) like gestures, facial expressions, and vocal tones. This research highlights a significant performance gap between current AI models and human capabilities, raising concerns about the readiness of these systems for meaningful human-AI interactions in real-world scenarios where understanding emotions, intentions, and social dynamics is paramount.

To address the limitations of existing ToM benchmarks, which primarily center on false-belief tasks and text-based reasoning, the researchers developed a three-stage framework comprising Detection, Knowledge, and Explanation. Detection involves identifying and labeling nonverbal cues from raw multimodal signals, such as video frames and audio, while Knowledge maps these cues to psychological meanings using an expert-curated body-language dictionary with 407 cues and 397 mind states. Explanation then combines contextual information to infer the underlying mental states. The M OTION 2M IND dataset was built from 497 hours of diverse YouTube videos, including sitcoms, films, and reality shows, with clips sampled as 4-second segments annotated through a hybrid approach of automatic pipelines and manual human inspection to ensure accuracy and relevance.

The evaluation of state-of-the-art vision-language models (VLMs) like GPT-4o, Qwen2.5-VL, and InternVL revealed stark deficiencies in NVC interpretation. In Detection tasks, where models had to identify nonverbal cues from video clips, even the best-performing models achieved only around 65% accuracy in multiple-choice formats, compared to over 80% for human experts. Explanation tasks, which required inferring psychological states from observed cues, showed an even wider gap, with models scoring as low as 45% accuracy versus 89% for humans. Notably, models exhibited a tendency toward overinterpretation, frequently assigning meaning to 'invalid' cues that lacked psychological significance, with smaller models showing higher rates of false positives and errors in contextual reasoning.

These have profound for the deployment of AI in socially intelligent applications, such as healthcare, education, and customer service, where misinterpreting body language could lead to misunderstandings or ethical breaches. The study underscores that while AI excels in structured knowledge tasks, it falters in the nuanced, context-dependent reasoning required for real-world social interactions. This gap suggests that current models may not be suitable for roles demanding high emotional intelligence, potentially limiting their effectiveness in collaborative environments and highlighting the need for more robust, human-like cognitive architectures in future AI development.

Despite its contributions, the research has limitations, including reliance on a single, Western-centric body-language dictionary that may not capture cultural variations in nonverbal cues. Additionally, the dataset, though diverse, sources videos from public platforms like YouTube, which could introduce biases in gesture interpretation across different societies. The authors caution that these factors might affect the generalizability of and call for future work to incorporate multicultural perspectives and address privacy concerns, as advanced NVC interpretation technologies could be misused for surveillance or manipulation without proper safeguards.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn