AI Models Fail at Basic Perspective-Taking, Study Finds

TL;DR

New research shows advanced AI systems struggle to understand others' viewpoints, raising concerns about their use in teaching and therapy.

As artificial intelligence systems increasingly take on roles as teachers, colleagues, and even companions, a new study exposes a critical weakness in their social cognition: they cannot reliably see the world from another's perspective. Researchers from the University of Cambridge and Microsoft Research Asia have found that multimodal language models—AI systems that process both text and images—show pronounced deficits in visuospatial perspective-taking, a fundamental human ability that underpins effective communication and collaboration. This limitation could have serious as these models are deployed in social domains where understanding others' viewpoints is essential.

The study evaluated four frontier models from OpenAI, including reasoning-optimized versions like o3 and o4-mini, using two tasks adapted from human psychology. In the Rotating Figure Task, models had to determine what symbols a person in an image could see and how those symbols appeared from that person's viewpoint, across varying angular disparities from 0 to 180 degrees. were stark: while models performed well on basic perceptual controls, they consistently failed at Level 2 perspective-taking, which requires inhibiting one's own perspective to adopt another's. For example, GPT-4o-mini's accuracy on visual questions dropped from 82% when perspectives were aligned to just 7% when they were fully opposite, indicating a systematic inability to transform visual information.

Ologically, the researchers employed a procedural approach inspired by recent work creating large, controlled experimental batteries through systematic variation of key parameters. They generated thousands of novel stimuli to reduce benchmark contamination and fine-tuned task demands by manipulating perspective level, content type, and viewpoint disparity. The Rotating Figure Task included control conditions to verify baseline perceptual abilities, such as symbol identification and line-of-sight judgments, while test conditions probed Level 1 and Level 2 perspective-taking. The Director Task, a referential communication paradigm, required models to follow instructions from a director with occluded views, testing visual and spatial perspective-taking in a functional context. Both image-based and text-based versions were used to isolate VPT limitations from visual processing deficits.

The data reveals distinct failure patterns across models. In the Rotating Figure Task, Level 1 perspective-taking—judging whether a person can see something—showed model-specific issues, such as right-facing spatial blind spots in GPT-4o and o4-mini. Level 2 perspective-taking—judging how something appears or where it is located from another's viewpoint—exposed more severe impairments. Reasoning models like o3 and o4-mini exhibited M-shaped accuracy profiles, succeeding only at fully shared or fully opposite perspectives but failing at intermediate rotations, suggesting reliance on simple mirroring heuristics rather than genuine mental rotation. In the Director Task, all models struggled when visual and spatial perspective-taking demands combined, with accuracy often dropping to floor levels. For instance, on spatial-different trials from the director's point of view, GPT-4o-mini's accuracy fell to 1%, while o3 managed 77% but still showed significant declines.

These have immediate for the real-world deployment of AI in collaborative environments. Perspective-taking is crucial for referential communication, spatial coordination, and joint action—abilities needed in settings like instruction-following, embodied reasoning, and therapeutic interactions. Models that cannot flexibly represent what others see may struggle in tasks requiring shared understanding, potentially leading to misunderstandings or failures in sensitive applications. The study suggests that current benchmarks, which often rely on text-based vignettes, may overestimate AI social cognition, highlighting the need for more rigorous, visuospatial evaluations to predict safe and reliable use.

However, the research also acknowledges limitations. The study focused on a specific set of models and tasks, and performance may vary with different architectures or training data. The tasks, while controlled, may not capture all nuances of real-world social interactions, and the models' failures could stem from architectural constraints rather than a lack of social understanding per se. Additionally, the study did not explore whether these limitations can be mitigated through further training or algorithmic improvements. Future work could investigate the neural correlates of perspective-taking in AI systems or develop new benchmarks that better simulate dynamic social scenarios.

Overall, this research provides a precise characterization of AI socio-cognitive capacities, showing that despite advances in multimodal processing, fundamental perspective-taking abilities remain lacking. As AI systems become embedded in human social and physical environments, rigorous assessment of these foundational skills will be essential for ensuring their safe and effective integration into our daily lives.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn