AI Gives Robots 3D Vision to Grab and Move Objects

TL;DR

A vision-language model uses depth sensing to predict 3D action points, helping robots grasp, navigate, and place objects with human-like precision.

Robots that can understand and act on spatial instructions, like "grasp the handle" or "place it in front of the mug," have long been a goal in artificial intelligence. However, most existing AI systems rely heavily on 2D images, which lack the depth information crucial for precise 3D interaction. This limitation often forces robots to guess spatial relationships, leading to errors in real-world tasks. A new study addresses this gap by introducing a that directly incorporates depth data into a vision-language model, enabling it to predict exact 3D points where a robot should act, bridging the divide between visual perception and physical execution.

The researchers formalized this capability as "embodied localization," which involves predicting executable 3D points based on visual observations and language instructions. They defined two complementary types of targets: touchable points, which are 3D coordinates on object surfaces for actions like grasping, and air points, which are free-space coordinates for tasks like placement or navigation. This dual approach allows the system to handle a wide range of embodied behaviors, from manipulating objects to moving through environments, all unified under a single framework. The core finding is that by integrating structured depth information, the model significantly improves its ability to reason about 3D space, outperforming existing s that use only RGB images.

To achieve this, the team developed SpatialPoint, a spatial-aware vision-language framework. The model takes RGB images and depth maps as parallel inputs. The depth map is encoded into a three-channel representation and processed by a dedicated depth backbone, producing depth tokens that are fused with visual and language features within a multimodal transformer. A key innovation is the use of a two-stage training strategy: first, the depth backbone is trained separately to align with the pre-trained vision-language model, then the entire system is fine-tuned jointly. This careful design ensures the depth modality enhances spatial reasoning without degrading existing capabilities. The model generates camera-frame 3D coordinates in a structured format, such as pixel coordinates and depth values, which can be directly used for robotic control.

Extensive experiments validated the approach. The researchers constructed a large-scale dataset, SpatialPoint-Data, with 2.6 million samples covering both touchable and air points. On touchable points evaluation, using the RoboAfford-Eval benchmark, SpatialPoint achieved an overall accuracy of 0.790, outperforming baselines like RoboBrain 2.5 (0.741) and Qwen3-VL-Inst-4B (0.503). It also showed a mean absolute depth error of 17.2 mm, significantly lower than alternatives. For air points, evaluated on SpatialPoint-Bench with 2,445 queries, the model achieved a direction correctness rate of 0.5071 and a metric precision of 0.3347 for points within 5 cm of the target, with a mean error of 6.8 cm. Ablation studies confirmed that incorporating depth was critical, as RGB-only variants performed worse, and the dual-backbone design with special depth tokens consistently improved .

For robotics and AI are substantial. This work enables more reliable and generalizable robotic systems that can interpret complex spatial instructions and execute precise actions in diverse environments. For example, in real-robot deployments, SpatialPoint successfully guided a robotic arm to grasp objects at specified locations, place items in target destinations, and navigate a mobile robot to goal positions—all without scene-specific fine-tuning. This demonstrates practical applications in areas like warehouse automation, assistive robotics, and interactive AI assistants, where understanding 3D space is essential for safe and effective operation.

Despite its advancements, the study acknowledges limitations. The model relies on monocular depth estimates, which can be inaccurate in textureless or reflective regions, potentially affecting performance. Additionally, the current focus is on static scenes; extending to dynamic environments or integrating trajectory-level planning remains for future work. The researchers also note that while the dataset is large, it may not cover all possible spatial relations, and further scaling could improve generalization. These s highlight ongoing areas for improvement in making AI systems truly adept at embodied spatial reasoning.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn