AI Agents Now Search Long Videos Like Humans Do

TL;DR

A new AI framework lets models scan long videos for evidence, cutting errors and boosting accuracy by mimicking how humans watch content.

A new approach to artificial intelligence is teaching machines to watch videos more like humans do, by actively searching for key moments instead of passively processing every frame. Researchers have developed LongVT, a framework that enables large multimodal models (LMMs) to reason over hours-long videos with greater reliability, addressing a critical weakness in current AI systems. This tackles of hallucinations—where models generate incorrect or fabricated information—which becomes especially problematic in long-form content where evidence is sparse and spread across time. By incentivizing what the team calls "Thinking with Long Videos," the system mimics human cognitive strategies, such as skimming globally and then zooming in on relevant segments, to improve accuracy in tasks like video question answering and temporal grounding.

The core is that AI models can be trained to use a native video cropping tool to inspect specific parts of a video dynamically, leading to more grounded and correct answers. In experiments, LongVT consistently outperformed existing strong baselines across four challenging long-video understanding benchmarks, including VideoMME, VideoMMMU, LVBench, and a newly curated benchmark called VideoSIAH-Eval. For example, on VideoSIAH-Eval, which involves open-ended questions requiring retrieval of fine-grained evidence from videos averaging about 1,688 seconds, LongVT achieved a score of 42.0, outperforming the second-best model by 6 points. This improvement narrows the gap with proprietary models like GPT-4o, bringing open-source AI closer to state-of-the-art performance in long-video reasoning.

Ology involves a three-stage training strategy that teaches models to interleave reasoning with on-demand temporal retrieval. First, a cold-start supervised fine-tuning stage empowers the base model with fundamental capabilities: proposing precise time windows for relevant events, reasoning over densely resampled frames within those windows, and self-correcting when a window is suboptimal. This stage uses a dataset of 247.9K samples for tool-integrated training. Second, an agentic reinforcement learning stage optimizes the model's decisions using a joint reward function that considers answer accuracy, format compliance, and temporal grounding precision, encouraging exploratory rollouts with improved localization. Third, an agentic reinforcement fine-tuning stage consolidates these behaviors by training on high-quality rollout traces from the RL phase, further stabilizing and enhancing performance.

Show significant gains in both accuracy and efficiency. Under dense frame sampling settings, LongVT variants outperformed existing open-source models by large margins, with LongVT-7B-RFT reaching scores like 67.0 on VideoMME and 43.7 on VideoMMMU. The framework's ability to reduce hallucinations is evident in qualitative examples: for instance, when asked which foot a French player used to score a volley in a soccer match, LongVT correctly identified the right foot by cropping and inspecting relevant video segments, while a text-based reasoning approach failed due to lack of visual evidence. Ablation studies confirmed that each training stage is crucial, with removal of any leading to performance drops, and that the designed reward functions effectively promote better temporal localization without requiring additional tool rewards.

Of this research extend to real-world applications where reliable video analysis is essential, such as in sports analytics, film understanding, and surveillance. By enabling AI to actively seek evidence rather than rely on passive processing, LongVT improves transparency and trustworthiness in automated systems. However, the study notes limitations, including the model's dependence on a curated dataset like VideoSIAH, which comprises 1,280 QA pairs with human validation, and potential s with memory footprint in recursive reasoning for ultra-long videos. Future work may explore multi-agent architectures to overcome context window constraints and further scale the approach to infinite video streams.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn