POMA-3D: Bridging the 2D-3D Gap with Point Maps for Next-Generation Scene Understanding

In the rapidly evolving landscape of artificial intelligence, the leap from 2D image understanding to comprehensive 3D scene comprehension has remained a formidable . While models like CLIP have revolutionized how machines interpret flat images by aligning vision with language, translating that success into the three-dimensional world has been hampered by fundamental data and representation mismatches. Now, a groundbreaking new approach from researchers at Imperial College London promises to finally bridge that gap. Introducing POMA-3D, the first self-supervised 3D representation model learned from point maps—a novel intermediate modality that preserves explicit 3D geometry while maintaining compatibility with existing 2D foundation models. This innovation could fundamentally reshape how AI systems perceive and interact with physical spaces, from augmented reality applications to embodied agents navigating complex environments.

The core breakthrough of POMA-3D lies in its strategic use of point maps as a bridging technology between 2D and 3D domains. Unlike traditional 3D representations like point clouds or depth maps that fundamentally differ from the grid-based formats that 2D models are trained on, point maps encode pixel-to-3D correspondences in a structured 2D grid. This means they preserve the global 3D geometry of scenes while presenting data in a format that 2D foundation models can readily process. The researchers constructed a massive new dataset called ScenePoint to enable this approach, comprising 6.5K room-level indoor scenes from real-world RGB-D datasets like ScanNet, 3RScan, and ARKitScenes, plus an additional 1 million single-view point maps generated from 2D image-caption datasets using the VGGT model. Each scene comes paired with LLM-generated descriptions at both view and scene levels, creating a rich training corpus for vision-language alignment.

Ologically, POMA-3D employs a sophisticated two-stage pretraining approach that cleverly leverages both 2D and 3D data sources. The first warm-up stage performs vision-language alignment on all image-derived single-view point maps, allowing the model to inherit rich semantic priors from 2D foundation models. The main stage then jointly optimizes two key objectives: view-to-scene vision-language alignment and a novel Point Map Joint Embedding-Predictive Architecture (POMA-JEPA). The alignment objective encourages the model to learn CLIP-aligned point map embeddings by matching them with corresponding image and text representations, while POMA-JEPA enforces geometric consistency across multiple viewpoints of the same scene—a critical capability for robust 3D understanding. Unlike traditional JEPAs that assume fixed spatial order, POMA-JEPA uses Chamfer Distance to handle the order-agnostic nature of 3D point sets, preventing mode collapse that plagued earlier approaches.

Demonstrate POMA-3D's remarkable versatility across diverse 3D understanding tasks. When evaluated on 3D question answering benchmarks including ScanQA, SQA3D, and Hypo3D, POMA-3D consistently outperformed existing state-of-the-art s. On SQA3D, which tests situated reasoning in 3D scenes, POMA-3D achieved 51.1% EM@1 compared to SceneVerse's 48.5%—a significant 2.6 percentage point improvement. Even more impressively, the model excelled at embodied navigation tasks on the MSNN dataset, achieving the highest performance on both four-directional (21.2%) and eight-directional (36.9%) navigation s. Perhaps most striking is POMA-3D's performance in scene retrieval, where it dramatically outperformed all comparison s across ScanRefer, Nr3D, and Sr3D datasets, with recall metrics sometimes doubling or tripling those of previous approaches.

Of this research extend far beyond benchmark performance. By successfully transferring 2D priors into 3D understanding, POMA-3D addresses one of the most persistent s in 3D AI: data scarcity. The ability to leverage massive 2D image-caption datasets for 3D learning represents a paradigm shift that could accelerate progress across the field. Furthermore, the model demonstrates strong zero-shot generalization capabilities, accurately performing embodied localization—retrieving point map views that match textual descriptions of an agent's situation without task-specific training. This suggests POMA-3D has learned genuinely transferable spatial representations rather than merely memorizing training patterns. The framework's compatibility with both specialist models (like those fine-tuned for specific tasks) and generalist models (like LLM-integrated systems) makes it particularly promising for real-world deployment.

Despite these impressive achievements, the researchers acknowledge several limitations that point to future research directions. The current evaluation within LLM-based models used only lightweight LoRA fine-tuning due to computational constraints, leaving open the question of how POMA-3D would perform as the backbone of a scratch-trained 3D LLM. Additionally, since point maps in this implementation contain only coordinate information without color features, effective fusion with visual appearance data remains unexplored—a potentially crucial enhancement for more holistic scene understanding. The model's current focus on indoor scenes also suggests opportunities for extension to outdoor environments and more diverse spatial contexts. Nevertheless, POMA-3D represents a substantial leap forward in making 3D understanding more scalable and accessible, potentially accelerating progress toward AI systems that can truly comprehend and navigate our three-dimensional world.

POMA-3D: Bridging the 2D-3D Gap with Point Maps for Next-Generation Scene Understanding

Original Source

About the Author

Guilherme A.