AI Video Generators Struggle with Long Stories

As artificial intelligence transforms content creation, a new benchmark reveals that AI models excel at producing short video clips but falter when generating longer, more complex narratives. This gap highlights a critical challenge for applications in entertainment, education, and communication, where coherent, extended videos are essential.

Researchers from the Harbin Institute of Technology and the University of Hong Kong developed LoCoT2V-Bench, a comprehensive evaluation tool for long-form text-to-video generation. They found that while current AI models achieve high scores in basic visual quality and short-term consistency, they perform poorly in maintaining long-range coherence and adhering to high-level narrative elements like emotional flow and character development. For instance, models such as MEVG and VGoT showed strengths in static quality but scored low in event-level alignment and transition smoothness, as detailed in the paper's results.

The methodology involved collecting 240 real-world videos from YouTube, spanning themes like nature exploration and virtual entertainment, to create realistic prompts averaging over 200 words. These prompts included complex elements such as scene transitions and dynamic actions. The team then used multimodal large language models (MLLMs) to assess generated videos across multiple dimensions, including text-video alignment, temporal quality, content clarity, and a novel metric called Human Expectation Realization Degree (HERD), which evaluates abstract attributes like narrative flow and emotional impact.

Analysis of the results, presented in tables and figures within the paper, indicates that models struggle with prompts of higher complexity, particularly in semantic and structural aspects. For example, as prompt complexity increased, performance in text-video alignment dropped significantly, suggesting that AI systems have difficulty interpreting intricate instructions. The study also noted that while static video quality—such as aesthetic and technical aspects—showed no strong bias in evaluations, it did not compensate for deficiencies in coherence and adherence to human expectations.

This research matters because it underscores the limitations of current AI in real-world scenarios where long videos are needed, such as in filmmaking, educational content, or virtual simulations. Improved models could enhance how stories are told and information is conveyed, making AI-generated videos more reliable and engaging for broad audiences. However, the paper cautions that existing methods still face challenges in long-term consistency and high-level narrative fidelity, pointing to areas for future innovation.

Limitations of the study, as outlined by the researchers, include the reliance on specific benchmark datasets and the current inability of models to fully capture human-like storytelling nuances. The entanglement between event-level alignment and temporal consistency also remains an open issue, indicating that more work is needed to disentangle these factors for accurate assessment.

AI Video Generators Struggle with Long Stories

About the Author

Guilherme A.