AI Agents Still Struggle to Work With Humans

TL;DR

A new dataset shows where AI assistants fail in real tasks, revealing key gaps in timing and coordination that limit real-world use.

Artificial intelligence systems are increasingly designed to assist people in physical tasks, from assembling gadgets to mixing drinks, but their ability to collaborate seamlessly remains a major hurdle. A new study introduces the SIGMACOLLAB dataset, capturing 85 sessions where untrained users interacted with a mixed-reality AI agent to perform real-world activities. This resource highlights persistent challenges in human-AI teamwork, such as miscommunication and timing issues, which could slow the adoption of AI helpers in everyday settings.

The researchers found that AI agents frequently struggle with real-time coordination during collaborative tasks. In the study, participants used a headset running the IGMA system, which provided step-by-step guidance through activities like building a notebook or preparing a mocktail. The AI relied on speech recognition and large language models to respond to user queries, but errors in detecting self-talk—when users talk to themselves—and delays in processing often disrupted the flow. For instance, the system misclassified 20.2% of user utterances on average, leading to inappropriate or missed responses that hindered task progress.

To gather data, the team configured IGMA to capture multimodal streams, including egocentric video, audio, and tracking of head and hand movements. Participants attempted eight different tasks, such as replacing a computer hard drive or making a pin-back button, while the system recorded their interactions. The setup involved HoloLens headsets and grayscale cameras to better monitor hand movements, with data streams processed in real-time on a desktop server. This approach ensured that the dataset reflects realistic scenarios where AI must interpret ambiguous commands and dynamic environments.

Analysis of the results shows significant variability in task success and user behavior. Out of 85 sessions, only 75% were completed correctly, with failures often due to system errors or user abandonment. The dataset includes over 3,200 user utterances and detailed gaze tracking, revealing that participants frequently tested the AI's limits—for example, by asking fragmented questions like 'Is it this one?' instead of full sentences. These patterns underscore the difficulty AI faces in understanding context and intent without explicit cues, a challenge not fully addressed in non-interactive datasets.

The implications extend to applications in homes, factories, and healthcare, where reliable AI assistance could improve efficiency and safety. However, current systems' inability to handle spontaneous dialogue or coordinate actions in real-time limits their practicality. The dataset's release aims to spur research into better models for proactive intervention and cognitive state detection, which are crucial for fluid collaboration.

Limitations noted in the paper include the dataset's small size—approximately 14 hours of data—and its focus on controlled lab settings, which may not capture all real-world complexities. Additionally, the AI's performance varied across different model deployments, with latency issues affecting responsiveness. Future work will use SIGMACOLLAB to establish benchmarks for improving interaction-related competencies in AI systems.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn