AI Teaches Tasks by Predicting the Next Video Step

TL;DR

This AI method generates the next step in a video to guide you through cooking, repairs, and more, no text instructions needed.

Imagine trying to learn how to tie a Windsor knot from a text description alone. The words might be accurate, but they often fall short in conveying the precise motions and spatial arrangements needed. Now, researchers have developed an AI system that can watch a video of your current progress and generate a custom video showing exactly what to do next. This breakthrough, called Video-Next-Event Prediction (VNEP), moves beyond telling to showing, offering a more intuitive way to guide users through procedural tasks or predict future events in dynamic scenes.

The key finding from this research is that AI can now generate video answers that are both semantically accurate and visually coherent. The system, named VANS, takes an input video and a question—such as "How to proceed with making my windmill?" or "What will she most likely do next?"—and produces a short video demonstrating the predicted next event. For example, in one case shown in the paper, the AI watches a video of someone wrapping a samosa and generates a clip showing the next step: pinching the edges to seal it. This approach outperforms existing s that rely on text-only answers or simple video continuation, as demonstrated in Figure 2, where a video answer for tying a Windsor knot provides clearer guidance than a text description.

Ology behind VANS involves a clever integration of two specialized AI models: a Vision-Language Model (VLM) for reasoning and a Video Diffusion Model (VDM) for generation. The VLM analyzes the input video and question to produce a textual caption describing the next event, while the VDM uses this caption along with visual cues from the input to generate the corresponding video. However, simply chaining these models together often leads to misalignment—the caption might be linguistically correct but visually unrealistic. To solve this, the researchers introduced Joint-GRPO, a reinforcement learning strategy that coordinates both models using a shared reward. This two-stage process first tunes the VLM to produce captions that are friendly for visualization, then adapts the VDM to generate videos faithful to those captions and the input context, as detailed in Figure 5.

From experiments on procedural and predictive benchmarks show that VANS achieves state-of-the-art performance. According to Table 1, VANS with Joint-GRPO scores a ROUGE-L of 0.3631 on procedural tasks, significantly higher than baseline models like Gemini-FilmWeaver at 0.2802. Visual metrics also improve, with CLIP-Video Score rising to 0.8021, indicating better visual quality and alignment. Qualitative comparisons in Figure 6 highlight how VANS avoids errors common in other s, such as misinterpreting events or producing visually inconsistent videos. For instance, in a predictive scenario, VANS correctly generates a video of a man retaliating after being slapped, while baselines like Omni-Video misinterpret the action or alter character appearances.

Of this technology are substantial for everyday applications. By providing video-based instructions, it could enhance learning platforms, assist in DIY projects, or even support creative storytelling. The ability to generate customized demonstrations based on a user's current state—like showing the next step in a recipe tailored to how the food looks—makes it more practical than generic tutorials. Moreover, the model's flexibility allows for multi-future prediction, as shown in Figure 11, where it can generate different plausible outcomes from the same input video depending on the question, such as predicting a realistic cough or a dramatic smoke effect in a movie scenario.

Despite its advancements, the research acknowledges limitations. The model relies on a curated dataset, VANS-Data-100K, which includes 100,000 video-question-answer triplets but may not cover all real-world scenarios. As noted in the paper, existing datasets were unsuitable due to suboptimal video quality or lack of diverse questions, so the team had to create their own through a multi-stage curation process involving shot splitting and quality filtering. Additionally, while Joint-GRPO improves alignment, s remain in handling extremely complex or ambiguous events, and the inference time—around 4 seconds for caption generation and 35 seconds for video generation—may need optimization for real-time use. Future work could focus on expanding the dataset and refining the model's efficiency to broaden its applicability.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn