AI Tracker Fixes Camera Glitches in Real Time at 65 FPS

TL;DR

A three-state switch lets this tracker handle RGB, infrared, and broken video feeds without losing objects during sensor failures.

Object tracking in video is a fundamental task for surveillance, autonomous vehicles, and robotics, but it often fails when cameras switch between different modes or encounter technical glitches. A new AI system called SwiTrack addresses this by dynamically adapting to three distinct states: normal RGB video, near-infrared (NIR) footage, and invalid frames where the sensor is overexposed or malfunctioning. This approach prevents the common problem of object drift, where trackers lose their target during modality transitions, offering a more reliable solution for real-world applications where lighting and sensor conditions are unpredictable.

The researchers found that SwiTrack significantly outperforms existing s on cross-modal tracking benchmarks. On the Cross-Modal Object Tracking Benchmark (CMOTB), which includes 1,000 video sequences with modality switches, SwiTrack achieved a precision rate of 62.3% and a success rate of 60.4% on the joint set, representing gains of 7.2% in precision and 4.3% in success over the best prior cross-modal tracker. It also maintained real-time performance at 65 frames per second, making it practical for deployment in scenarios requiring quick responses, such as security monitoring or robotic navigation.

SwiTrack's ology centers on a tri-state switch that classifies each video frame as RGB, NIR, or invalid. For RGB frames, the system uses a standard visual encoder based on a Vision Transformer (ViT) to extract features. For NIR frames, it activates a NIR gated adapter, a lightweight module that refines the infrared features by referencing dynamic template features from previous frames, ensuring consistency across modalities. The adapter employs a hierarchical gate to adjust weights at different network layers, with shallow features requiring more modification than deep semantic features. For invalid states, caused by overexposure during infrared illuminator activation, the system freezes visual features and instead uses a consistency trajectory prediction (CTP) module that leverages historical motion cues to estimate the target's position, preventing drift.

Analysis, detailed in Table I of the paper, shows SwiTrack's superiority across easy, hard, and joint subsets of CMOTB. Compared to single-modal trackers like DropTrack, it improved precision by 5.4% and success by 4.7% on the hard subset, which includes s like modality delay. Against multi-modal trackers like BAT, it achieved gains of 15.2% in precision and 14.6% in success, highlighting its advantage in handling single-modality inputs at a time. Attribute-based analysis in Table II and Figure 6 further demonstrates robustness, with improvements in modality-related s (e.g., modality adaptation and mutation) and general tracking issues like occlusion. Visualization in Figure 7 confirms stable localization in challenging scenes, and feature analysis in Figure 8 shows more discriminative representations with the NIR adapter.

Of this research are significant for industries relying on continuous object tracking under variable conditions. By handling invalid states without losing targets, SwiTrack could enhance security systems that switch to night-vision modes, improve autonomous vehicle sensors in changing light, or support robotics operating in environments with intermittent sensor reliability. The real-world test using a cross-modal camera (MI AW300) in Figure 11 showed consistent tracking across visible, NIR, and invalid states, validating its practical applicability. 's efficiency, with 65 FPS and 173 million parameters, also makes it suitable for edge devices where computational resources are limited.

Despite its advancements, the paper notes limitations. The approach is primarily validated on RGB-NIR tracking, and while a cross-dataset test on RGB-Thermal sequences showed some generalization, performance may vary with other modality pairs like depth or event streams. The invalid state detection relies on a pixel-based threshold for white counting, which, as shown in Table V, is robust but may require tuning for different cameras or environments. Additionally, the CTP module assumes linear motion models, which might not capture complex non-linear movements perfectly, though the use of an Extended Kalman Filter helps mitigate this. Future work could explore extending the tri-state framework to more diverse modalities and improving motion prediction under extreme dynamics.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn