AI Cuts 85% of Visual Data and Still Drives Safely

TL;DR

A new autonomous driving method mimics human attention, filtering out most visual noise while keeping 94% accuracy on standard hardware.

Autonomous driving systems face a critical : processing vast amounts of visual data from multiple cameras over time without overwhelming computational resources. This bottleneck has limited the deployment of advanced AI models in real-world vehicles, where efficiency is as crucial as accuracy. A new approach, ETA-VLA, addresses this by mimicking how human drivers allocate attention, dynamically focusing on task-relevant information while discarding redundant details. This innovation could accelerate the adoption of sophisticated vision-language-action models in everyday cars, balancing safety with practicality.

The researchers developed ETA-VLA to tackle the 'token bloat' problem in vision-language-action models, where processing historical multi-view frames leads to a quadratic explosion in computational demands. Their key finding is that by pruning up to 85% of visual tokens—the data units representing image patches—the model reduces inference FLOPs by 61% while retaining 94% of the original accuracy on the NAVSIM v2 benchmark. Specifically, on the Navtest benchmark, ETA-VLA achieved an Extended Predictive Driver Model Score (EPDMS) of 85.0, closely approaching human expert performance of 90.3, and outperformed state-of-the-art baselines like DiffusionDrive (84.3 EPDMS) and GTRS w/ SimScale (84.6 EPDMS). This demonstrates that aggressive sparsification, when guided by semantic relevance, does not compromise driving quality.

Ology combines two components: a Temporal Fusion Module (TFM) and an Intra-LLM Sparse Aggregator (ILSA). The TFM compresses historical frames from multiple cameras into a concise representation using learnable time weights, reducing the linear growth of tokens over time. For example, it aggregates features from past frames via a transformer-based encoder, producing a single visual feature map that preserves critical motion cues. The ILSA, integrated into the Large Language Model backbone, then performs fine-grained token selection. It uses a RoPE-free semantic scoring mechanism to evaluate visual tokens based on their relevance to driving instructions, without positional bias, and employs a diversity-preserving recycling strategy to retain spatially unique tokens, ensuring comprehensive scene awareness. This mimics human attention allocation, as visualized in Figure 1, where the model prioritizes front-view regions during straight driving and balances multiple views during turns.

From extensive experiments on the NAVSIM v2 benchmark show that ETA-VLA not only matches but exceeds dense baselines in efficiency and performance. As detailed in Table III, applying ILSA at Layer 4 with a 35% pruning rate reduces GFLOPs from 9,105 to 6,190—a 32% reduction—while increasing EPDMS to 85.0, compared to 84.8 for the dense baseline. On the more challenging Navhard benchmark, ETA-VLA achieved a state-of-the-art EPDMS of 48.0, outperforming GTRS w/ SimScale (47.2) and demonstrating robustness in pseudo closed-loop scenarios. Ablation studies in Table IV confirm the necessity of both components: disabling temporal fusion dropped EPDMS to 77.2, and compared to alternatives like SparseVLM (83.3 EPDMS), ETA-VLA's proved superior due to its dynamic, instruction-aware pruning.

Of this work are significant for the automotive industry, as it enables high-reasoning AI models to run on resource-constrained vehicle hardware without sacrificing safety or accuracy. By reducing computational demands, ETA-VLA makes it feasible to deploy advanced autonomous driving systems that can process complex spatiotemporal data in real-time, potentially lowering costs and energy consumption. This approach generalizes beyond driving, offering a blueprint for efficient multimodal AI in other domains like robotics or surveillance, where balancing detail with efficiency is critical. The human-like attention mechanism also enhances interpretability, as the model's focus aligns with intuitive driving priorities.

Despite its advancements, the study acknowledges limitations. The model's performance degrades with excessive pruning, such as applying ILSA at multiple layers, which disrupts feature continuity and leads to significant drops in EPDMS, as seen with scores falling to 78.9 when pruning at both Layers 2 and 4. Additionally, the experiments are confined to the NAVSIM v2 benchmark, and real-world validation in diverse driving conditions remains untested. The paper notes that pruning at deeper layers (e.g., Layer 6 or 8) causes severe performance loss, indicating that critical planning primitives in later stages are vulnerable to information loss. Future work could explore adaptive pruning rates or integration with other efficiency techniques to further optimize for edge devices.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn