Spatial AI Benchmarks Don't Measure What They Claim

TL;DR

A new study shows spatial supersensing benchmarks miss key real-world skills, raising doubts about how we evaluate AI perception systems.

In the relentless pursuit of artificial intelligence that can truly understand the world, researchers have long sought to build systems capable of spatial supersensing—the ability to form coherent, predictive models of environments from raw video streams. This isn't just about recognizing objects in a frame; it's about accumulating evidence, maintaining state, and tracking entities over time, much like a human navigating a space and remembering where things are. A recent, high-profile effort to benchmark this elusive capability, dubbed Cambrian-S and its VSI-Super tasks, promised to be a rigorous test. It introduced two specific s: VSI-Super-Recall (VSR), which asks a model to recall the order in which an object appeared in different locations over videos lasting up to four hours, and VSI-Super-Counting (VSC), which requires counting unique objects across an entire video, avoiding double-counting duplicates. The premise was compelling: success here would demand genuine spatial cognition and memory, moving beyond the shortcut learning that plagues so many AI benchmarks. But a new, critical analysis reveals a sobering truth. These benchmarks, despite their thoughtful design, do not reliably measure spatial supersensing at all. Instead, they can be nearly solved—or their bespoke solutions exposed—by s that exploit superficial heuristics, bypassing the need for any world modeling or long-horizon reasoning. This finding, detailed in a paper titled "Spatial Supersensing Without Spatial Supersensing," throws cold water on claims of progress and underscores the profound difficulty of evaluating true video understanding.

The investigation hinges on two elegantly simple stress tests. For the VSR benchmark, which is intended to test a model's ability to track an object's trajectory over time, the researchers constructed a baseline called "NoSense." This is deliberately atemporal. It uses a SigLIP vision-language model to independently encode each frame of a video, discarding nearly all temporal structure. It then simply keeps the top four frames most similar to the query object and scores multiple-choice answers based on cosine similarities to auxiliary object prompts. No long-term memory, no object tracking, no 3D representation, no language model. Astonishingly, this trivial baseline nearly perfectly solves VSR, achieving 98.3% accuracy on 10-minute videos and maintaining around 95% even on the marathon 4-hour splits. This outperforms the Cambrian-S by a staggering 55% in absolute performance. The implication is devastating: VSR primarily measures image-level semantic perception and object-context associations, not the intended long-range temporal understanding or spatial integration. A handful of informative frames is all that's needed, rendering the hours of video context largely irrelevant.

For the VSC counting benchmark, the researchers devised an even more revealing sanity check called VSC-Repeat. The task requires a model to count unique objects, like chairs, across a video, meaning it must recognize when the same chair reappears and not count it twice. Cambrian-S proposed a sophisticated, surprise-based event segmentation inference strategy, designed to segment the video at points of high prediction error and count objects within each segment. To test if this was performing genuine spatial supersensing or just exploiting a dataset quirk, the team created a perturbed version of the benchmark. They took each original VSC video and simply concatenated it with itself 1 to 5 times, creating sequences where the same rooms and objects are revisited repeatedly. Crucially, the ground-truth count of unique objects does not change. A model with true spatial supersensing should recognize the repetitions and output the same count. Cambrian-S's catastrophically failed this test. Its mean relative accuracy collapsed from 42.0% on the original videos to 3.6% after two repeats, and plummeted to 0% after five repeats. Its predicted counts grew linearly with the number of repeats, effectively counting each revisited segment as containing new, unique objects. This proves the inference pipeline wasn't building a persistent map; it was relying on the shortcut assumption that each segment in the VSC benchmark corresponds to a distinct, never-revisited environment.

Of this analysis are far-reaching for the field of multimodal AI and benchmark design. It reveals a dangerous case of benchmark-model co-adaptation. Cambrian-S developed two different, bespoke inference pipelines—one for recall, one for counting—that inadvertently aligned perfectly with the generative assumptions baked into the VSI-Super dataset construction. High scores were achieved not by advancing spatial supersensing, but by encoding simple rules like "there are four object insertions" (for VSR) or "rooms are unique and non-repeating" (for VSC). This calls into question any claims of progress in video world modeling based solely on performance on these tasks. Furthermore, the success of the NoSense baseline highlights the raw power of modern contrastive vision-language models like SigLIP as perception engines. It suggests that a substantial portion of what passes for advanced video reasoning in current systems might be reducible to sophisticated frame-level retrieval, a far cry from the dynamic, stateful understanding we aspire to.

Looking forward, the study advocates for a more rigorous, meta-evaluative approach to benchmark creation. The authors propose that any benchmark claiming to test long-horizon reasoning must include built-in invariance checks—transformations like repeating scenes, shuffling segments, or changing playback speed that leave the ground-truth answer unchanged but break superficial shortcuts. Performance under these adversarial conditions should be reported alongside standard scores. They also argue for using more natural, long-form video data with realistic revisits and loops, rather than curated clips that implicitly guide model design. The tools introduced here—NoSense and VSC-Repeat—serve as prototypes for this kind of stress testing. The goal is to foster benchmarks that are robust to trivial solutions and truly force the development of capabilities like spatial supersensing. Until then, the field must interpret performance on existing video understanding tasks with extreme caution, recognizing the vast gulf between solving a benchmark and solving the underlying cognitive .

This work, while focused on a specific benchmark family, points to a systemic issue in AI evaluation. The Cambrian-S authors, in a response included in the paper, acknowledge the limitations highlighted and agree on the need for more realistic data and robust tests, framing their work as a first step. The critique is not that spatial supersensing is an unworthy goal, but that claiming to measure it requires benchmarks of a different, more adversarial caliber. As AI systems grow more capable, the benchmarks we use to judge them must become equally sophisticated, resistant to the clever shortcuts that machines, unlike humans, are so adept at finding. The path to true video world models remains long, and this analysis is a crucial reminder that the first step is ensuring we're actually on the right path to begin with.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn