As videos grow longer and more complex, artificial intelligence systems struggle to process them efficiently due to the overwhelming number of visual tokens. This bottleneck limits applications in video analysis, from content moderation to automated summarization. Researchers have introduced FOCUS, a training-free method that selects key frames based on query relevance, dramatically improving performance while using less than 2% of frames in benchmarks like LongVideoBench and Video-MME.
The key finding is that FOCUS identifies the most informative frames in long videos without exhaustive computation. By treating video clips as arms in a multi-armed bandit problem, it applies an optimistic upper-bound selection strategy to prioritize regions with high relevance or uncertainty. This approach ensures that AI models focus on critical moments, such as specific actions or objects mentioned in a query, rather than sampling frames uniformly and missing essential content.
Methodologically, FOCUS partitions videos into fixed-length clips and formulates frame selection as a combinatorial pure-exploration task. It uses empirical Bernstein confidence bounds to estimate frame-query relevance scores, guiding a two-stage process: coarse exploration to filter out irrelevant areas and fine exploitation to select top-scoring frames within promising segments. This model-agnostic technique integrates seamlessly with existing multimodal large language models (MLLMs) like GPT-4o and Qwen2-VL, requiring no retraining and minimal overhead.
Results from the paper show substantial gains in video-question answering accuracy. On LongVideoBench, FOCUS achieved an 11.9% improvement over baselines for videos longer than 30 minutes, with consistent benefits across short, medium, and long video categories. For instance, it boosted GPT-4o's performance by 3.2% and smaller models like Qwen2-VL-7B by 6.7%, using only 1.6% of frames on average. Visualizations in Figure 3 of the paper illustrate how FOCUS concentrates frames around query-relevant events, unlike uniform sampling that spreads them thinly across the timeline.
In practical terms, this advancement makes long-video analysis more accessible and efficient, supporting real-world uses in surveillance, education, and media production where processing hour-long content is common. By reducing computational demands—cutting GPU hours from over 250 to just 5.5 in some cases—FOCUS enables faster, cost-effective deployments without sacrificing accuracy.
Limitations noted in the paper include the assumption that frame-query scores are independent and identically distributed within clips, which may not hold for videos with strong temporal dependencies. Future work could explore extensions to Lipschitz or contextual bandit settings to address correlations between segments, potentially refining performance in more dynamic video environments.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn