Surveillance cameras now blanket our cities, but human operators can't possibly watch every feed. A new artificial intelligence system can automatically detect criminal activity in video footage with remarkable precision, potentially transforming how we monitor public spaces.
The key finding from researchers at Sharif University of Technology is that their AI system achieves 92.41% accuracy in identifying five types of human activity—from normal behavior to crimes like arson, burglary, fighting, and explosions. This represents a significant improvement over previous methods, which typically achieved between 62% and 86% accuracy on similar tasks.
The methodology cleverly focuses on what matters most: people. The system first uses YOLO-World, an advanced object detector, to identify humans in each video frame. Think of this like having a super-attentive security guard who instantly spots every person in view. Then, using ByteTrack technology, it follows each person across multiple frames, maintaining their identity even as they move through crowded scenes.
Here's the smart part: everything outside the human bounding boxes gets blurred using Gaussian blurring, similar to how photographers blur backgrounds to make subjects stand out. This background suppression eliminates distractions like changing lighting, weather conditions, or irrelevant movement, forcing the AI to concentrate on human behavior rather than environmental noise.
The refined video clips then pass through InceptionV3, a neural network pre-trained on ImageNet that extracts detailed features about human appearance, posture, and context. Finally, a Bidirectional Long Short-Term Memory (BiLSTM) network analyzes sequences of 32 frames, capturing how actions evolve over time—crucial for distinguishing between normal walking and suspicious loitering, or between playful wrestling and actual fighting.
The results demonstrate exceptional performance across all metrics. As shown in Table I of the paper, the system maintained consistent accuracy across three independent trials (92.80%, 92.95%, and 91.48%), with minimal variance. The confusion matrix revealed very few cross-class errors, meaning the system rarely mistakes one type of activity for another.
Particularly impressive were the F1-scores—a balanced measure of precision and recall—which exceeded 0.85 for all activity types. The system excelled at detecting explosions (F1-score of 0.95) and burglary (0.92), while still performing well on more subtle activities like fighting (0.85) and arson (0.85). The ROC-AUC values, which measure discrimination capability, reached 0.98 across all classes, indicating near-perfect separation between normal and anomalous events.
This technology matters because current surveillance systems often miss critical events due to human operator fatigue or the overwhelming volume of footage. The paper notes that continuous manual inspection is "labor-intensive, costly, and subject to fatigue, attention lapses, and inconsistency." An automated system that can flag potential crimes with 92% accuracy could help security personnel focus their attention where it's most needed, potentially preventing crimes before they escalate.
The system's foreground-focused approach also makes it more adaptable to different environments. By ignoring background variations, it performs consistently across diverse settings without needing retraining for each new location—a common limitation of previous methods.
However, the research acknowledges limitations. The system was tested on a curated five-class subset of the UCF-Crime dataset, and its performance on the full range of real-world anomalies remains unknown. The authors plan to extend their framework to incorporate additional modalities and test on the complete UCF-Crime dataset to evaluate broader anomaly recognition capabilities.
Another limitation involves computational requirements. While the system demonstrates practical potential, real-time deployment on resource-constrained surveillance systems may require optimization. The current implementation used NVIDIA T4 GPUs during testing, which may not be available in all surveillance setups.
Despite these limitations, the research provides a robust foundation for intelligent video analytics that prioritizes human behavior over environmental noise—a approach that could make our increasingly monitored world both safer and more efficient.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn