Vision-language AI models can now identify and describe human activities in healthcare settings with surprising accuracy, even though they weren't specifically trained for this task. This breakthrough could transform remote patient monitoring by allowing a single AI system to both recognize activities and communicate naturally with healthcare providers.
Researchers discovered that general-purpose vision-language models (VLMs) achieve performance comparable to specialized activity recognition systems. In tests using the Toyota Smarthome dataset, which contains videos of daily activities like cooking, cleaning, and using phones, three VLMs demonstrated they could accurately describe what people were doing without being explicitly trained for activity recognition.
The team developed a systematic method to evaluate how well these AI models understand human activities. They created descriptive captions for 16,115 videos showing 18 elderly participants performing daily tasks, using GPT-4o to generate captions that closely matched the actual activities. The researchers then tested four different evaluation methods to measure how accurately the AI models could recognize and describe these activities.
Results showed that InternVL2-5 achieved 83.8% accuracy using cosine similarity evaluation, while DeepSeek-VL2 reached 78.6% accuracy - both outperforming several specialized activity recognition models. Even Llama3.2-Vision, which wasn't specifically designed for this task, achieved 67.4% accuracy in cross-subject evaluation, surpassing models like AssembleNet++, LTN, and VPN++.
This matters because current healthcare monitoring systems typically require separate AI components for activity recognition and natural language interaction. Using a single VLM for both functions could reduce computational demands and create more efficient remote monitoring systems. For elderly patients or those with chronic conditions, this means more natural interactions with monitoring systems while maintaining privacy since the AI doesn't need to store raw video data.
The study did identify limitations. Some evaluation methods proved unreliable - BERTScore sometimes rated incorrect descriptions as accurate, while VLM-as-Judge evaluation showed lower performance than expected. The models also struggled in cross-view evaluations where training and testing used different camera angles, with traditional models like π-ViT performing better in these scenarios. Additionally, some VLMs generated overly detailed descriptions that hurt their accuracy scores when evaluated for semantic similarity.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn