DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning

TL;DR

A new benchmark exposes how AI systems break down on multi-step spatial reasoning tasks, casting doubt on their true understanding capabilities.

As artificial intelligence systems become increasingly integrated into scientific research and logical problem-solving, a critical question emerges: can these models truly understand and reason, or are they simply pattern-matching their way through complex tasks? A new study from researchers at the Turing Institute and Imperial College London provides sobering answers through a rigorous testing framework called DecompSR.

The researchers discovered that current large language models exhibit significant limitations in compositional spatial reasoning—the ability to combine multiple relational steps to solve novel problems. While models performed reasonably well on simple one-step questions, their accuracy dropped dramatically as reasoning complexity increased. For example, GPT-4o's accuracy fell from 99% on single-hop questions to just 15% on 100-hop problems, revealing a fundamental brittleness in their reasoning capabilities.

The team developed DecompSR, a dataset of 5.2 million procedurally generated spatial reasoning problems, to systematically test different aspects of compositional reasoning. Each problem consists of a story describing spatial relationships between objects, followed by a question requiring multi-step inference. The methodology ensures correctness through automated verification, eliminating ambiguity about whether answers are right or wrong. By controlling variables like reasoning depth, linguistic variation, and the presence of distractors, the researchers could precisely measure where models succeed and fail.

Results showed that models struggled particularly with increasing reasoning depth. As shown in Table 4 of the paper, most models experienced sharp performance declines beyond 4-5 reasoning steps. The study also tested systematicity—the ability to apply learned rules to novel situations—by replacing directional words with nonsense terms. Models that performed well with familiar English terms often failed completely when faced with the same logical structure expressed in artificial language, suggesting they rely on surface-level patterns rather than deep understanding.

These findings matter because many real-world applications, from scientific discovery to legal analysis, require the kind of systematic reasoning that current AI models lack. If models cannot reliably combine known concepts in novel ways, their utility in domains requiring true understanding remains limited. The research suggests that current evaluation methods focusing solely on final-answer correctness may be masking fundamental weaknesses in AI reasoning capabilities.

The study acknowledges that while some models showed better performance than others—notably o4-mini maintained 70% accuracy at 4 hops compared to GPT-4o's 29%—all exhibited similar patterns of degradation with increasing complexity. The researchers note that their findings highlight the need for more sophisticated evaluation methods that probe reasoning processes rather than just outcomes, potentially guiding future AI development toward more robust and generalizable systems.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn