AI Model Predicts Future Scenes to Control Robots Better

TL;DR

Mantis trains robots by imagining what comes next, hitting near-perfect scores in simulations and beating rivals in real-world tests.

Robots that can understand human instructions and perform complex tasks have long been a goal of artificial intelligence research, but teaching them to act reliably in diverse environments remains a significant . A new AI model named Mantis, developed by researchers from multiple universities and Bosch, offers a promising solution by enabling robots to 'imagine' future visual outcomes before taking action. This approach not only enhances performance but also preserves the AI's ability to comprehend language and reason, addressing a common pitfall where robot-specific training degrades these critical skills. The model's success, demonstrated in both simulated benchmarks and real-world experiments, suggests a more efficient path toward versatile robotic assistants that can generalize beyond their training.

Mantis achieves its capabilities through a novel technique called Disentangled Visual Foresight (DVF), which separates the task of predicting future visual scenes from the core action-learning process. Instead of forcing the AI to directly generate high-dimensional future images—a computationally expensive and distracting task—Mantis uses a specialized component called a Diffusion Transformer head to handle this prediction. This component receives the current visual state through a residual connection, allowing it to focus on capturing the subtle changes between frames that represent latent actions, such as the motion of a robotic arm. These latent actions then guide the model's explicit action predictions, making the learning process more targeted and efficient. The backbone of the model, based on the Qwen2.5-VL vision-language model, remains free to maintain its language understanding and reasoning abilities through continued language supervision during training.

The researchers trained Mantis using a progressive three-stage recipe that incrementally incorporated different types of data to avoid overwhelming the model. First, it learned from 220,000 human manipulation videos to predict future frames at multiple time gaps, helping it infer latent actions from visual dynamics. Next, it was trained on 76,000 robot demonstrations to align visual predictions with actual actions. Finally, language supervision was added using 38 multimodal datasets to preserve comprehension skills. This staged approach ensured stable optimization and effective fusion of vision, language, and action modalities. For inference, the team introduced an Adaptive Temporal Ensemble strategy that dynamically adjusts computational effort based on the need for motion stability, reducing inference counts by up to 50% without sacrificing performance.

In simulation experiments on the LIBERO benchmark, which tests robotic manipulation across spatial, object, goal, and long-horizon tasks, Mantis achieved a 96.7% average success rate, outperforming all baseline models. As shown in Table 1 of the paper, it surpassed both vision-augmented s like CoT-VLA (81.1%) and non-vision-augmented ones like OpenVLA (76.5%). Figure 5 illustrates that Mantis converged faster than traditional visual foresight approaches, reaching high success rates within fewer training epochs. Real-world tests on an Agilex robotic platform further validated its superiority: Mantis accurately followed in-domain instructions and generalized to out-of-domain ones—such as putting a cup on 'Taylor Swift' after training on 'female singer'—outperforming the leading open-source model π0.5. Figure 6 shows Mantis achieving higher success counts across three scenarios, with particularly strong performance on tasks requiring world knowledge and basic reasoning.

Of this research extend beyond robotics to broader AI applications where combining perception, language, and action is essential. By decoupling visual foresight, Mantis reduces training costs and improves efficiency, making it feasible to deploy such models in real-world settings like manufacturing, healthcare, or domestic assistance. The model's ability to maintain language understanding while learning actions could lead to more intuitive human-robot interactions, where robots can interpret complex commands and adapt to new situations without extensive retraining. The release of code and weights as open-source resources also supports community-driven advancements, potentially accelerating innovation in embodied AI and multimodal systems.

Despite its successes, Mantis has limitations noted in the paper, such as minor motion rollbacks in real-world scenarios due to the lack of robot state inputs like joint positions or force feedback. This can affect precision in tasks requiring fine-grained control. Future work may integrate richer sensory data, such as 3D point clouds, to enhance robustness and further optimize inference speed. The researchers also highlight that while language supervision preserves reasoning abilities, as shown in ablation studies where a language-unsupervised variant performed poorly on out-of-domain instructions, there is room to improve how these capabilities scale with even more diverse training data. Addressing these s will be crucial for deploying Mantis-like systems in unpredictable environments where reliability and adaptability are paramount.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn