AI Speeds Up Video Models by Dropping Useless Data

TL;DR

ReDiPrune cuts compute costs in AI video models by keeping only the most useful visual inputs, maintaining or improving accuracy.

Multimodal large language models, which combine text with images and videos, have become powerful tools for tasks like video understanding and image analysis. However, their efficiency is often hampered by the need to process thousands of visual tokens, leading to high computational costs and slow inference times. A new approach called ReDiPrune addresses this issue by pruning unnecessary visual tokens before they are processed, offering a plug-and-play solution that works without retraining the model. This could make AI systems faster and more accessible for real-world applications, from automated video analysis to interactive assistants.

ReDiPrune operates by selecting a subset of visual tokens directly from the vision encoder's output, before these tokens are projected into the language model's embedding space. The selection is guided by a lightweight scoring system that balances two factors: text relevance, which ensures tokens are aligned with the user's query, and visual diversity, which prevents redundancy by choosing distinct patches. For example, in a video, it might focus on frames where an action occurs, ignoring static or background scenes. This pre-projection strategy preserves fine-grained semantic details that can be lost in other s, as shown in Figure 1 of the paper, where ReDiPrune maintains accuracy while reducing computation.

The researchers tested ReDiPrune on multiple benchmarks using models like Video-LLaVA-7B and LLaVA-NeXT-Video-7B. On video tasks such as EgoSchema and NextQA, retained only 15% of visual tokens but improved accuracy by up to 2.0% while cutting computation by more than 6 times in terms of TFLOPs. For instance, with LLaVA-NeXT-Video-7B, it increased accuracy on EgoSchema from 43.6% to 45.6% and boosted WUPS on NextQA from 26.33 to 26.42, all while reducing TFLOPs from 29.92 to 4.43. On image benchmarks like GQA and ScienceQA-IMG, ReDiPrune also performed competitively, often matching or exceeding other pruning s under strict token budgets, as detailed in Table 2 of the paper.

This efficiency gain has practical for deploying AI in resource-constrained environments. By reducing latency and memory usage, ReDiPrune could enable faster video analysis for applications like surveillance, content moderation, or educational tools. The paper notes that on ActivityNet-QA, it lowered end-to-end latency from 0.447 seconds to 0.146 seconds for Video-LLaVA-7B, as shown in Table 3. Moreover, 's training-free nature means it can be easily integrated into existing systems without costly fine-tuning, making it a versatile tool for improving multimodal AI performance across diverse tasks.

Despite its advantages, ReDiPrune has limitations. The experiments focused primarily on LLaVA-style architectures, and its frame-wise selection approach may not fully optimize for spatio-temporal contexts in more complex videos. The paper acknowledges that extending to other multimodal backbones and developing adaptive token budgeting strategies are areas for future work. Additionally, while it improves efficiency, the small overhead from text-guided scoring means it is slightly slower than text-agnostic s like DivPrune, though it compensates with better accuracy, as highlighted in the ablation studies.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn