A new AI system can create detailed 3D maps of indoor environments in real time, addressing a critical need for assistive navigation technologies. Developed by researchers from Eötvös Loránd University and Pázmány Péter Catholic University, the framework processes video streams to build temporally coherent maps that track objects and detect changes, such as a chair becoming occupied or a bag being moved. This capability is essential for applications like helping visually impaired users navigate cluttered, unfamiliar spaces where scene elements frequently shift. The approach overcomes the high memory demands of previous s, making it feasible on standard hardware like an NVIDIA RTX 4090 GPU with 24GB of VRAM, as detailed in the paper.
The key finding is that the system achieves memory-efficient, close to real-time 3D semantic mapping by processing video in blocks and aligning submaps. Unlike standard Vision Gated Generative Transformers (VGGT), which struggle with long sequences due to memory growth—exceeding 40GB at 200 frames—this partitions input into non-overlapping blocks, such as 60 frames, to stay within GPU limits. It uses VGGT to generate depth and pose estimates for each block, then aligns them globally using keyframes selected with an ORB feature detector to suppress motion-blur noise. This allows the pipeline to handle sequences of over 1,000 frames, as demonstrated on the TUM RGB-D dataset with up to 1,400 frames, while maintaining geometric consistency and supporting streaming input.
Ology involves a multi-step pipeline that integrates geometry, semantics, and change detection. First, the system divides the video stream into blocks, applies VGGT to predict depth maps and camera poses, and aligns submaps using a similarity transform based on depth scaling from LiDAR data when available. For semantic mapping, it uses YOLOv9e for 2D instance segmentation and the VGGT tracking head to aggregate masks into 3D objects with persistent global IDs, as illustrated in Figure 3. Change detection is implemented by assigning timestamps to objects and updating their states—RECENT, RETAINED, or REMOVED—based on visibility checks against depth images, with confidence decay for missing objects. This enables the system to identify environmental changes, such as a seat transitioning from empty to occupied, without requiring full sequence access.
Analysis shows competitive performance on standard benchmarks, with achieving an average absolute trajectory error (ATE) of 0.062 meters on TUM RGB-D sequences and 0.072 meters on 7-Scenes, as reported in Tables 2 and 3. Notably, it outperforms VGGT-SLAM on challenging scenes like TUM floor, reducing error from 0.254 meters to 0.063 meters by using shorter block sizes to avoid divergence. Memory usage remains stable at around 17.8GB VRAM across sequences, as shown in Table 1, enabling long-sequence operation without optimization. Qualitative , including videos and point-cloud comparisons in Figure 4, demonstrate the system's ability to build coherent 3D maps with semantic labels in real time, supporting tasks like finding empty seats in indoor environments.
The context of this research is its potential to enhance assistive navigation by providing users with up-to-date spatial awareness in dynamic settings. By maintaining object identities across occlusions and viewpoint changes, the system can guide users through cluttered areas and alert them to changes, such as moved obstacles or occupied seating. This has practical for accessibility technologies, where real-time, low-latency processing is crucial for safety and usability. The framework's efficiency on commodity hardware, as tested on a desktop with an AMD Ryzen Threadripper and NVIDIA RTX 4090, makes it scalable for deployment in real-world scenarios, from public spaces to home environments.
Limitations include s in instance tracking when objects are not visible at the start of a block, as the VGGT tracking head initializes from the first frame, potentially leaving objects untracked for up to 0.5 seconds in worst-case scenarios. The paper notes that this has negligible impact on assistive navigation but could be addressed by allowing initialization from multiple frames, though this would increase computational overhead. Additionally, primarily targets indoor environments and may face difficulties with highly reflective surfaces or extreme motion blur, as acknowledged in the introduction. Future work, as mentioned in the conclusion, will explore integrating moving dynamic objects into the representation to further improve robustness in real-world applications.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn