AIResearch AIResearch
Back to articles
Robotics

AI Maps Indoor Spaces in a Single Walk

A new method uses a person or robot with a head-mounted camera to automatically create 3D maps and register ceiling cameras, overcoming visual ambiguity that stymied previous approaches.

AI Research
March 27, 2026
3 min read
AI Maps Indoor Spaces in a Single Walk

Indoor spaces like schools, stores, and factories often rely on ceiling-mounted cameras for surveillance, virtual reality, and human-computer interaction, but accurately mapping these environments and registering camera positions has been a persistent . Manual s are inefficient and costly, while automated techniques using visual localization can fail when scenes have similar textures or structures, creating ambiguity. Researchers from Fujitsu Research have developed a novel solution called YOWO (You Only Walk Once) that jointly maps an indoor scene and registers ceiling-mounted cameras to the layout in a single pass, offering a robust and scalable tool for real-world applications.

The key finding is that YOWO enables a mobile agent—such as a person or robot equipped with a head-mounted RGB-D camera—to walk through an indoor space once while synchronized ceiling cameras capture the agent. This process generates world-coordinate agent trajectories and the scene layout from the ego-camera videos, while the ceiling camera videos provide pseudo-scale agent trajectories and relative camera poses. By correlating all trajectories with timestamps, the ceiling camera poses can be aligned to the world-coordinate scene layout, and a factor graph is then used to jointly optimize ego-camera poses, scene layout, and ceiling camera poses. The result is a unified framework that not only accomplishes both tasks but also enhances their performance compared to separate approaches.

Ology involves three main processes: ego-camera processing, ceiling camera processing, and collaborative processing. In ego-camera processing, SLAM (Simultaneous Localization and Mapping) is applied to RGB-D videos from the head-mounted camera to construct point clouds and real-scale ego-camera trajectories, with rectification to align the vertical direction. For ceiling camera processing, the cameras are clustered into groups based on co-observations of the mobile agent's keypoints, and relative poses are estimated within each group using incremental structure-from-motion techniques. Collaborative processing then uses a spatiotemporal rebalanced registration algorithm to match the ego-camera and ceiling camera trajectories, initializing the alignment of ceiling camera poses to the scene layout before joint optimization with a custom factor graph.

Experimental on a new dataset created for this task show that YOWO outperforms state-of-the-art s. In CMC relative 6-DoF pose estimation, YOWO achieved an average position error of 0.008 meters and rotation error of 0.135 degrees, compared to 0.024 meters and 0.338 degrees for ECCMP and 0.015 meters and 0.216 degrees for OMECC. For scene mapping, YOWO reduced ego-camera ATE RMSE to 0.122 meters and improved layout IoU to 0.927, surpassing BAD-SLAM (0.286 meters, 0.848 IoU) and NICE-SLAM (0.252 meters, 0.873 IoU). In CMC 6-DoF pose registration, YOWO maintained errors within 0.348 meters and 0.762 degrees, while s like Hloc and Kapture had maximum errors exceeding 10 meters and 50 degrees due to visual ambiguity issues.

Are significant for practical applications, as YOWO provides a reliable tool for downstream position-aware uses such as indoor localization, multi-sensor data collection (e.g., Wi-Fi strength or temperature heatmaps), and 3D human pose estimation. 's ability to handle visual ambiguity through mobile keypoints makes it suitable for environments with low texture or symmetry, and it can be applied offline to facilitate tasks like surveillance and augmented reality. However, the researchers note limitations: YOWO assumes a single agent in the scene, which may restrict use in crowded settings, and performance depends on the agent's path, requiring planning to ensure sufficient overlap and loop closures. Despite this, the framework represents a step forward in automating indoor mapping and camera registration, with potential impacts across various industries.

Original Source

Read the complete research paper

View on arXiv

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn