AI Video Models Fail at Basic Logic and Physics Tests

TL;DR

A new benchmark shows top video generators struggle with simple reasoning, physics, and spatial tasks, despite producing impressive visuals.

Recent video generation models like Sora-2, Veo-3.1, and Kling-2.5 have dazzled with their ability to create realistic scenes, but a new study shows they often lack genuine reasoning skills. Researchers have developed V-ReasonBench, a comprehensive benchmark that systematically tests these models across four core reasoning dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics. reveal significant gaps between visual fidelity and true cognitive understanding, highlighting that these systems can produce beautiful videos while failing at basic logical tasks. This disconnect raises important questions about the path toward AI that can truly reason about the world, not just depict it.

V-ReasonBench evaluates models using a called Chain-of-Frame, which treats video generation as a sequence of reasoning steps. In this approach, a model receives an initial image and a prompt, then produces a series of frames where the final frame represents its answer. The researchers assess only this last frame, making evaluation scalable and unambiguous. They built the benchmark from 326 reasoning instances, represented by 652 image pairs, covering tasks like solving Sudoku, predicting physical motions, completing visual patterns, and understanding spatial relationships. Each model generates five videos per prompt, resulting in 9,780 videos analyzed, with performance measured using a pass@5 metric that checks if the final frame correctly solves the task.

Show clear strengths and weaknesses across different models. Sora-2 leads in structured problem-solving with a score of 72.00, spatial cognition at 36.76, and pattern-based inference at 40.00, but drops to 26.67 in physical dynamics. In contrast, Hailuo-02 and Vidu-Q2 achieve the highest scores in physical dynamics at 36.67 each, indicating that some models prioritize visual coherence over underlying principles. Overall, the average scores range from 10.68 for Seedance-1.0-Lite to 43.86 for Sora-2, demonstrating that no model excels uniformly across all reasoning types. These dimension-wise differences suggest that current video models capture different aspects of reasoning rather than integrating them holistically.

Beyond overall scores, the study uncovers specific failure modes. Some models, like Seedance-1.0-Lite, tend to add decorative patterns or alter scenes in tasks requiring strict structural accuracy, such as visual symmetry or Tic-Tac-Toe, which reduces their pass rates. This creative bias likely stems from training on open-domain video data that values visual richness over precision. Additionally, increasing video duration does not consistently improve reasoning; longer sequences often introduce irrelevant content or hallucinations, where intermediate frames show unrealistic transitions even if the final frame is correct. For example, in maze-solving tasks, models might depict a mouse passing through walls to reach cheese, violating causal consistency.

The benchmark also compares video models with image-based systems, revealing that temporal modeling offers advantages in physical and procedural reasoning. Video models like Veo-3.1 use the Chain-of-Frame process to simulate intermediate states, improving performance on tasks like block sliding or code execution. However, they still struggle with hallucinations, where correct endpoints mask flawed reasoning processes. Human-alignment validation shows the evaluation pipeline achieves 97.09% accuracy with human judgments, ensuring reliability. These insights emphasize the need for models that balance visual generation with robust reasoning, moving beyond aesthetic completion to structure-preserving accuracy.

V-ReasonBench has limitations, including its focus on last-frame evaluation, which may miss errors in intermediate reasoning steps. The benchmark relies on hybrid scoring s—mask-based, grid-based, and VLM-based evaluation—but VLMs themselves can struggle with fine-grained visual details, as shown in failure cases like sequence completion tasks. Future work could explore more nuanced assessments of reasoning processes and expand task diversity. Despite these constraints, V-ReasonBench provides a reproducible foundation for advancing video reasoning, highlighting that current models are far from achieving human-aligned cognitive capabilities. As AI video generation progresses, benchmarks like this will be crucial for steering development toward true understanding, not just impressive visuals.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn