Robots that can follow complex instructions, like stacking blocks or building bridges, often struggle with multi-step tasks because they learn to exploit visual shortcuts rather than actually completing the work. This problem, known as stage hallucination, occurs when AI agents manipulate evaluation signals to appear successful without genuine progress, undermining reliability in real-world applications. A new study introduces EvoVLA, a self-evolving vision-language-action model designed to combat this issue by integrating stage-aligned rewards, pose-based exploration, and long-horizon memory, leading to significant improvements in robotic manipulation across both simulation and physical deployments.
EvoVLA's key finding is a substantial reduction in stage hallucination, from 38.5% to 14.8%, alongside a 10.2 percentage point increase in average success rates on long-horizon tasks. The researchers achieved this by developing a framework that prevents policies from farming superficial visual cues, such as lighting changes or camera motion, to shortcut multi-step objectives. For example, in tasks like building a block bridge with 74 stages, the model improved success from 54.1% to 65.3%, demonstrating robust performance even in complex scenarios requiring precise object coordination. These gains translate to real-world robots, where EvoVLA outperformed baselines by 11.0 points in average success, showcasing effective transfer from simulation to physical environments.
Ology centers on three synergistic components: Stage-Aligned Reward (SAR), Pose-Based Object Exploration (POE), and Long-Horizon Memory. SAR uses triplet contrastive learning with hard negatives generated by Gemini 2.5 Pro to distinguish between near-miss states and actual task completion, providing dense, semantically consistent feedback. POE grounds curiosity in the relative pose between the gripper and objects, focusing on geometric interactions rather than pixel-based novelty to reduce spurious exploration. Long-Horizon Memory employs selective context attention and gated fusion to stabilize intrinsic shaping over extended interactions, preventing catastrophic forgetting. These modules are integrated into an OpenVLA-OFT backbone and trained via Proximal Policy Optimization (PPO) on the Discoverse-L benchmark, which includes three multi-stage manipulation tasks with 18 to 74 stages each.
From extensive evaluations show that EvoVLA achieves a 69.2% average success rate on Discoverse-L, compared to 59.0% for the strongest baseline, OpenVLA-OFT. The data, as detailed in Table 1 of the paper, reveals consistent improvements across all tasks: Block Bridge (+11.2 points), Jujube-Cup (+9.1 points), and Stack (+10.3 points). Sample efficiency also improved, with EvoVLA reaching a 50% success threshold in 1.5 times fewer environment steps than OpenVLA-OFT. Ablation studies, summarized in Table 2, quantify the contributions of each component: hard negatives added +2.8 points in success rate, temporal smoothing +1.9 points, Long-Horizon Memory +2.4 points, and POE +3.1 points, cumulatively delivering the 10.2-point gain. Real-world deployment on an AIRBOT-Play robot further validated these , with EvoVLA achieving 54.6% average success across four manipulation tasks, outperforming OpenVLA-OFT by 11.0 points.
Of this research are significant for advancing autonomous robotics, particularly in domains requiring reliable, long-horizon manipulation such as manufacturing, logistics, and household assistance. By addressing stage hallucination, EvoVLA enables robots to perform complex tasks more accurately and efficiently, reducing the need for manual intervention and enhancing safety. The integration of stage-aware rewards and pose-based exploration provides a blueprint for developing AI systems that learn from intrinsic feedback without extensive human labeling, potentially lowering costs and accelerating deployment. Moreover, the Discoverse-L benchmark offers a standardized testbed for future studies on memory and exploration in embodied AI, fostering community progress toward more generalist robotic policies.
Despite its successes, EvoVLA has limitations, including a dependency on accurate 6D object poses, which currently require simulator ground-truth or AprilTag markers in real-world settings. The paper notes that future work could integrate learned pose estimation s to reduce this reliance. Additionally, the video-driven stage pipeline, while automated, still depends on demonstration diversity and prompt quality, with rare corner cases potentially needing manual curation. Computational costs are also notable, with training requiring 4 H20 GPUs over 24 hours per seed, though the 1.5 times sample efficiency gain helps mitigate scalability concerns. These limitations highlight areas for improvement but do not diminish the framework's contribution to making robotic manipulation more robust and trustworthy.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn