Imagine watching a video clip and accurately guessing what happens next—a skill that comes naturally to humans but has long eluded artificial intelligence. Researchers from the University of North Carolina at Chapel Hill have developed a system that tackles this challenge, aiming to predict future events in videos by combining visual cues, dialogue, and commonsense knowledge. This work, detailed in their paper 'What More Likely to Happen Next? Future Event Prediction,' introduces the Video-and-Language Future Prediction (VLEP) dataset and a transformer-based model, marking a step toward AI that can reason about everyday scenarios like humans do. For non-technical readers, this matters because it could enhance technologies such as autonomous systems, video analysis tools, and interactive AI assistants, making them more intuitive and responsive to real-world dynamics.
The key finding is that AI can now predict which of two possible future events is more likely to occur after a video clip, though it still falls short of human performance. The researchers discovered that incorporating multiple types of information—specifically video content, dialogue, and commonsense knowledge—significantly improves the AI's accuracy. For instance, when the model uses only future event descriptions without any context, it achieves an accuracy of 58.09%, barely above random chance. Adding dialogue boosts this to 66.63%, and combining video with dialogue and future events reaches 67.46%. In comparison, humans achieve 90.50% accuracy when they have access to both video and dialogue, highlighting a substantial gap that underscores the complexity of this task.
To accomplish this, the team employed a methodology centered on a novel dataset and a machine learning model. They created the VLEP dataset, which includes 28,726 prediction examples from 10,234 video clips sourced from TV shows and YouTube lifestyle vlogs. These clips, averaging 15.2 seconds in length, feature diverse scenarios like sitcoms, medical dramas, crime series, travel vlogs, and family daily life. Each example consists of a premise (a short video clip with dialogue), a summary of the premise event, and two potential future events—one more likely and one less likely to happen—along with rationales written by human annotators. To ensure the dataset was challenging and minimized biases, the researchers used a human-and-model-in-the-loop procedure. This involved adversarial collection, where annotators tried to fool an AI model by writing less obvious future events, and adversarial matching, which paired events from different clips to create hard negatives. For example, in one clip, a detective might be shown with evidence, and the AI must infer that handing it over is more likely than an unrelated action.
The model itself is a transformer-based system that encodes video, dialogue, and commonsense knowledge. Video encoding captures appearance and motion using pre-trained networks like ResNet-152 and ResNeXt-101, while text encoding processes dialogue and future events with RoBERTa, a language model fine-tuned on commonsense data from ATOMIC, a knowledge base of if-then inferences. These encodings are combined in a multimodal transformer to make predictions. Results show that this approach outperforms simpler versions; for instance, adding commonsense knowledge improved accuracy from 66.96% to 67.46%. The analysis also revealed that adversarial collection methods reduced biases, such as the overuse of negation words in less-likely events, making the dataset more robust. In qualitative examples, the model correctly predicted events requiring understanding of intentions or reactions, like a character taking a phone based on dialogue cues, but it struggled in cases demanding deeper commonsense, such as inferring that food isn't ready yet in a cooking scene.
In terms of real-world context, this research has implications for applications where anticipating events is crucial, such as in security surveillance, autonomous driving, or personalized content recommendations. By improving AI's ability to reason about future actions, it could lead to systems that better understand human behavior in videos, enhancing safety and user experience. For example, an AI in a smart home might predict a person's next move to prevent accidents, or a video analysis tool could flag potential incidents in real-time. The use of diverse video sources, including unscripted vlogs, makes the findings relevant to everyday situations, moving beyond controlled environments.
However, the study acknowledges limitations, primarily the performance gap between AI and humans. The model's 67.46% accuracy compared to humans' 90.50% indicates that significant challenges remain, particularly in handling complex commonsense reasoning and subtle contextual cues. The paper notes that the dataset, while large and varied, may not cover all possible scenarios, and the model's reliance on pre-trained components could introduce biases from their training data. Future work is needed to address these issues, such as developing better ways to integrate video and dialogue or expanding the commonsense knowledge base to close the human-AI gap.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn