AI Models Still Fail at Basic Spatial Tasks, Survey Finds

TL;DR

A new survey shows AI systems can't handle simple navigation or object manipulation, revealing a major gap between current AI and human-level intelligence.

Spatial reasoning, the ability to perceive and manipulate relationships in the 3D world, is a cornerstone of human intelligence, yet it remains a major hurdle for multimodal large language models (MLLMs). These AI systems, which combine language and visual processing, excel at tasks like translation and image recognition but falter when faced with spatial s such as navigating a room or predicting how objects move. This gap is not just a technical limitation; it affects real-world applications in robotics, autonomous driving, and augmented reality, where precise spatial understanding is essential for safety and functionality. The survey by researchers at the University of Pittsburgh introduces a new cognitive framework to systematically analyze these shortcomings, offering a clearer path toward developing AI that can interact with the physical world as humans do.

A key finding from the survey is that MLLMs often rely on statistical patterns in data rather than genuine geometric understanding. For example, these models might learn that the phrase 'left of' frequently appears between words like 'cube' and 'circle' in text, but they lack an internal representation of what 'left' means in a 3D space. This leads to failures in tasks requiring dynamic reasoning, such as mental rotation or perspective changes, where humans easily simulate transformations. The researchers categorize spatial tasks into five cognitive functions—like intrinsic versus extrinsic frames of reference and static versus dynamic reasoning—and find that models perform well on simple, static descriptions but struggle with complex, multi-step problems. Benchmarks show that over 70% of evaluations focus on easy, qualitative tasks, masking deeper deficiencies in metric and dynamic reasoning that are crucial for real-world use.

Ology of the survey involves a novel taxonomy that organizes spatial reasoning based on cognitive dimensions rather than input modalities like text or images. This approach divides tasks into categories such as intrinsic-qualitative-static, which deals with properties of single objects, and extrinsic-qualitative-dynamic, which involves simulating changes in object relationships across a scene. The researchers map existing benchmarks, such as SPARTQA and MindCube, onto this taxonomy, revealing that most datasets emphasize relational questions at low complexity levels. They also review evaluation metrics, noting that traditional measures like accuracy and BLEU scores often fail to capture geometric correctness, leading to an overestimation of model capabilities in spatial tasks.

From the analysis indicate significant imbalances in how spatial reasoning is assessed. For instance, quantitative tasks are predominantly limited to simple counting, with few benchmarks testing metric properties like distance or volume estimation. In dynamic reasoning, models show poor performance on tasks requiring mental simulation, such as folding a cube or navigating from multiple viewpoints. The survey highlights that s to improve spatial reasoning fall into two categories: training-based approaches, which embed spatial knowledge through architectural changes or synthetic data, and inference-based s, which use techniques like chain-of-thought prompting to guide reasoning. However, both approaches face s, such as error propagation in multi-step inferences and the high cost of generating realistic 3D data, limiting progress toward human-like spatial intelligence.

Of these are profound for everyday technology. In robotics, for example, MLLMs' inability to reason about object positions could lead to errors in tasks like picking up items in a cluttered space. For autonomous vehicles, poor spatial understanding might result in misjudging distances to obstacles, posing safety risks. Augmented reality applications could suffer from inaccurate overlays if models cannot handle perspective changes. The survey suggests that addressing these gaps could enable AI to better assist in navigation, design, and interactive systems, making technologies more reliable and intuitive for users. By focusing on cognitive principles, researchers can develop benchmarks that more accurately reflect real-world s, driving innovations that bridge the divide between digital and physical intelligence.

Despite these insights, the survey acknowledges limitations in current research. Many benchmarks rely on synthetic data, which may not capture the noise and variability of real-world environments, leading to models that overfit to artificial patterns. Additionally, architectural constraints of transformers, which process information as discrete sequences, hinder the encoding of continuous spatial properties. The researchers note that future work must address these issues by developing richer datasets and novel architectures, such as those incorporating persistent memory for dynamic tasks. This critical analysis underscores that while MLLMs have made strides in language and vision, achieving robust spatial reasoning requires a fundamental shift in how AI models are designed and evaluated.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn