AI Detects Complex Events in Video in Near Real Time

TL;DR

A graph-based system spots activities like falls and traffic patterns in milliseconds, enabling real-time analysis for safety and smart city apps.

In an era where billions of video streams are generated daily from security cameras, smartphones, and IoT devices, the ability to automatically detect complex events—like a person falling or traffic congestion—has remained a significant challenge. Current systems struggle with the unstructured nature of video data and the computational demands of real-time analysis. Researchers have now developed a method that not only accurately identifies these events but does so with remarkable speed, achieving detection in as little as 4 milliseconds. This breakthrough could transform applications from urban management to healthcare monitoring by providing immediate insights from live video feeds.

The key finding is that the researchers' Video Event Knowledge Graph (VEKG) approach can detect complex spatiotemporal patterns in video streams with high accuracy and sub-second latency. By converting video content into a structured graph representation, the system identifies objects, their attributes, and their interactions over time and space. This allows it to recognize events such as falls, handshakes, or high traffic volume by applying predefined rules to the graph structure. The method achieved F-Scores ranging from 0.44 to 0.90 across different event types, with particularly strong performance in fall detection (F-Score of 0.87 on the L2ei dataset) and high volume traffic detection (F-Score of 0.90 on the DETRAC dataset).

The methodology involves a hybrid approach that combines deep learning models with logical reasoning. First, object detectors like YOLOv3 identify items such as people or cars in each video frame, while trackers like DeepSORT follow these objects across frames. Attributes like color or type are extracted using classifiers. This information is then structured into a knowledge graph where nodes represent objects and edges capture their spatial and temporal relationships. For efficiency, the researchers introduced an optimized version called VEKG-Time Aggregated Graph (VEKG-TAG), which summarizes the graph over time windows, reducing redundant data. This graph-based representation acts as an intermediate bridge between low-level video pixels and high-level human-understandable events, enabling the system to reason about complex patterns without processing raw video data directly.

The results demonstrate significant improvements in both accuracy and speed. In experiments on 801 video clips from datasets including HMDB, UCF-101, and DETRAC, the VEKG-TAG optimization reduced the number of nodes by up to 99% and edges by 93%, leading to a 5.19 times faster search time compared to the standard VEKG. The median event pattern latency ranged from 4 to 20 milliseconds, making it suitable for real-time applications. For instance, in fall detection, the system identified abrupt changes in a person's aspect ratio and lack of motion, correctly detecting falls in scenarios where individuals collapsed multiple times. The construction time for VEKG-TAG increased by only 7.4% over VEKG, but the search time dropped dramatically, especially for longer video windows.

This technology has immediate real-world implications. In smart cities, it could enable traffic management systems to detect congestion or jaywalking instantly, improving safety and flow. For healthcare, it offers a tool for monitoring elderly individuals, automatically alerting caregivers to falls. The method's ability to process video without relying on raw data also addresses privacy concerns, as analysis can occur on summarized graphs rather than sensitive footage. By handling dynamic queries—such as changing traffic thresholds—it adapts to user needs without retraining, making it versatile for various domains like activity recognition in sports or security surveillance.

However, the approach has limitations. Its performance depends on the accuracy of underlying deep learning models; errors in object detection can lead to missed events or false positives. The current 2D calculations may not fully capture real-world complexities, such as interactions in three-dimensional spaces or scenarios with blind spots and moving cameras. Additionally, writing generalized rules for highly complex activities like cooking or juggling remains challenging, and the system may struggle with inconsistent camera fields of view. Future work aims to enhance the graph with 3D data and improve reasoning techniques to address these gaps.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn