Robots powered by artificial intelligence are becoming increasingly capable, but their ability to process continuous visual streams in real-time remains a bottleneck. Vision-Language-Action (VLA) models, which enable robots to understand language instructions and execute actions based on visual input, struggle with heavy computational costs that hinder deployment in dynamic settings. A new approach, called VLA-Pruner, addresses this by intelligently pruning redundant visual tokens—the data representations of image patches—without sacrificing the robot's ability to perform complex tasks. This aligns with the dual nature of VLA models, which must balance high-level semantic understanding with precise low-level action execution, offering a practical solution for efficient robotic inference.
The researchers found that existing token pruning s, designed for Vision-Language Models (VLMs), are ill-suited for VLA applications because they rely solely on semantic salience metrics, such as prefill attention scores. This bias causes them to discard visual information critical for action generation, leading to significant performance drops, especially at high pruning ratios. For instance, in experiments on the LIBERO benchmark, s like FastV and SparseVLM degraded success rates as pruning increased, with VLA-Pruner outperforming them by maintaining up to 88.9% relative accuracy even when retaining only 12.5% of visual tokens. At a 50% prune ratio, VLA-Pruner even improved performance in some cases, such as on the LIBERO-Long suite, by filtering out noise that stabilizes policy execution.
VLA-Pruner employs a dual-level importance criterion to select which visual tokens to retain. It uses vision-language prefill attention scores to gauge semantic-level relevance and action decode attention scores to assess action-level importance. Since action decode attention is unavailable during the prefill stage, leverages temporal continuity in robot manipulation, estimating current action attention from recent timesteps using a decaying window average mechanism. This involves smoothing action attention scores over a short window, with a decay rate set to 0.8 based on empirical analysis. The token selection strategy then combines max-relevance pooling, which takes the union of tokens salient at either semantic or action levels, followed by min-redundancy filtering to reduce feature overlap, ensuring a compact yet informative set of tokens within a given compute budget.
Demonstrate VLA-Pruner's effectiveness across multiple VLA architectures and robotic tasks. On the LIBERO benchmark, using models like OpenVLA and OpenVLA-OFT, VLA-Pruner achieved state-of-the-art performance, with success rates of 85.4% on Spatial tasks and 51.8% on Long tasks at a 25% token retention ratio, compared to lower baselines. It also delivered substantial efficiency gains, reducing FLOPs to about 40% of the original model at 25% retention and achieving up to 1.8× speedup in inference latency. In real-world tests with a 6-DoF xArm6 robot, VLA-Pruner maintained high task success rates, such as 97.5% for Can Stack under a 75% prune ratio, showcasing its practicality for on-robot deployment without requiring retraining or architectural changes.
Of this research are significant for advancing embodied AI and robotic applications. By reducing computational overhead, VLA-Pruner enables more efficient real-time operation of robots in resource-constrained environments, from industrial automation to household assistance. Its plug-and-play nature allows it to be integrated into various state-of-the-art VLA architectures, including autoregressive policies like OpenVLA and diffusion-head models like π0, enhancing generalizability. However, 's reliance on temporal smoothing may face s in highly dynamic scenarios with rapid viewpoint shifts, where attention patterns change abruptly. Future work could explore adaptive prediction modules to better handle such cases, further refining the balance between efficiency and performance in diverse robotic settings.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn