Artificial intelligence systems are gaining a new level of spatial intelligence, allowing them to understand and reason about physical environments with unprecedented precision. Researchers have developed a framework called Video2Layout that enables AI models to construct detailed, metric-grounded maps from video footage, moving beyond vague descriptions to exact numerical coordinates. This advancement addresses a core limitation in current multimodal large language models, which often struggle with tasks requiring fine-grained spatial perception, such as determining exact distances between objects or their precise orientations. By translating visual scenes into a formal coordinate system, provides a verifiable foundation for spatial computation, reducing the ambiguity inherent in natural language descriptions and enhancing the model's ability to navigate and interpret complex spaces.
The key finding from this research is that AI models can generate accurate cognitive maps using continuous object boundary coordinates, which quantify inter-object distances and object sizes. The V2LO-7B model, developed through this framework, achieved an average improvement of 4.92% over models trained on traditional grid-based maps across multiple spatial reasoning benchmarks. Specifically, it reached an overall accuracy of 47.46% on combined benchmarks, outperforming GPT-4o's 46.25% and Gemini 2.0 Flash's 43.29%. In tasks like perspective-taking on the OmniSpatial-Bench, it scored 46.70%, surpassing other open-source models such as InternVL2.5-8B (41.14%) and SpaceR-7B (41.35%). This demonstrates that metric-grounded maps significantly enhance spatial reasoning by providing exact positional data, enabling more reliable calculations for tasks like measuring distances or determining relative directions.
Ology involves a two-stage training process using a high-quality dataset called V2LO-28K, constructed from both simulated and real-world sources. In the first stage, supervised fine-tuning uses data from the AI2THOR simulator to teach the model to map visual inputs to precise boundary coordinates, learning to generate a bird's-eye view representation with objects placed in a Cartesian coordinate system. The second stage employs reinforcement fine-tuning with the GRPO algorithm on real-world data from the ScanNet dataset, bridging the gap between simulation and actual environments to improve generalization. The framework also introduces a structured chain-of-thought process, where the model first creates a map module with object coordinates, then uses a think module for mathematical deductions like computing Euclidean distances or vector operations, and finally outputs an answer module with the result.
From the QVS-Bench diagnostic benchmark show that the accuracy of cognitive map construction is influenced by the number of input images, with optimal performance achieved using 4 to 8 frames. For instance, overall cognitive map accuracy peaked at 61.32% with 4 images, compared to 60.05% with 1 image, but declined to 56.28% with 16 images due to increased complexity and potential confusion. Task-specific performance varied: in minimum distance tasks, accuracy improved from 27.10% with 1 image to 36.00% with 12 images, while object counting accuracy dropped from 66.80% to 41.40% as image quantity increased, likely because redundant views complicated identification. Ablation studies confirmed the superiority of metric-grounded maps over grid-based alternatives, with V2LO-7B scoring 51.52% overall versus 46.60% for a 20x20 grid map, and structured numerical computation outperforming general text-based reasoning by 7.9% in overall scores.
Of this research extend to applications where precise spatial understanding is critical, such as robotics navigation, augmented reality systems, and autonomous vehicles. By enabling AI to reason about spaces with quantitative accuracy, this could improve tasks like indoor navigation for assistive devices, layout planning in virtual environments, or real-time object tracking in dynamic scenes. The structured approach also makes the AI's reasoning process more transparent and verifiable, as it outputs explicit coordinates and mathematical steps, which could enhance trust and reliability in safety-sensitive domains. Furthermore, the benchmark and dataset developed provide tools for future research to explore how visual input quantity affects AI performance, guiding the design of more efficient training protocols.
Limitations of the study include the model's performance degradation with excessive image inputs, as noted in the QVS-Bench where accuracy decreased beyond 8 frames, suggesting s in processing highly complex scenes. The training data, while comprehensive, relies heavily on simulated environments from AI2THOR, and although reinforcement fine-tuning addresses some generalization issues, real-world applicability may still be constrained by dataset diversity and noise. The research also highlights that different spatial tasks respond variably to input quantity, indicating that no one-size-fits-all approach exists for optimizing image use across all reasoning scenarios. Future work could focus on adapting the framework to more diverse real-world datasets and exploring ways to maintain accuracy with higher image counts through improved architectural designs.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn