AI Models Fail at Basic Reasoning Tasks Kids Handle Easily

TL;DR

A new benchmark shows top AI systems struggle when tasks require multiple cognitive skills at once, exposing a key gap in human-like intelligence.

A new study from ShanghaiTech University introduces a benchmark that tests AI models using tasks inspired by children's intelligence tests, uncovering surprising weaknesses in current systems. The research, detailed in a paper accepted at ICLR 2026, evaluates multimodal large language models (MLLMs) on abilities like memory, planning, and visual reasoning through a suite of 12 interactive games. This approach reveals that while top models excel at simple tasks, they struggle when multiple skills are required simultaneously, pointing to limitations in achieving human-like cognitive flexibility.

The researchers found that closed-source models like OpenAI's o3 and GPT-5, along with Google's Gemini-2.5-Pro, achieved near-perfect scores on specific tasks but showed significant drops in performance on more complex s. For example, in the Classification task, which tests execution ability by placing items into designated baskets, o3 scored 1.00 at the easiest level but dropped to 0.92 at the hardest level. In contrast, open-source models generally performed worse, with QwenVL-2.5 (7B) scoring only 0.23 at the easiest level of the same task. The study highlights that models are particularly weak in tasks requiring perception reasoning and planning, with even the best models scoring low in these dimensions on a capability radar chart.

The benchmark, named KidGym, is built on a 2D grid-based environment using the Gym API, allowing for customizable scenarios with randomized layouts to prevent memorization. It assesses five core capabilities: Execution, Perception Reasoning, Memory, Learning, and Planning, inspired by the Wechsler Intelligence Scales for children. Each of the 12 tasks targets one or two of these abilities, with three difficulty levels (L1 to L3) to test model limits. For instance, the Maze task evaluates planning by requiring models to navigate through locked doors using keys, while the Memory Maze adds a memory component by hiding the diamond's location after an initial view. The researchers designed mechanisms like a backpack and hint bar to aid models, but still observed failures in integrating multiple information types.

From evaluating nine state-of-the-art MLLMs show that models face three main s. First, they struggle with reasoning over non-semantic visual information, as seen in the Puzzle task where the highest success rate was only 0.30 for GPT-5 at L1, barely above random chance. Second, models are insensitive to item quantity, with the Counting task revealing that even Gemini-2.5-Pro achieved only 0.72 at L1, while humans scored 1.00. Third, composite tasks that require multiple abilities, like Memory Maze, saw success rates drop significantly compared to single-capability versions. The study also found that reasoning s like chain-of-thought improved performance for some models but not others, and in-context learning sometimes underperformed zero-shot approaches due to overfitting to examples.

Of these are significant for the development of artificial general intelligence (AGI), as they indicate that current MLLMs lack the integrated cognitive skills needed for real-world applications. The researchers note that while models have advanced in areas like learning and memory, gaps in perception reasoning and planning hinder their ability to handle dynamic, multi-step problems. This benchmark provides a tool for the AI community to track progress and address these weaknesses, with potential applications in robotics, education, and interactive systems where human-like reasoning is crucial.

Limitations of the study include the current number of tasks being limited, though the framework is extensible for future additions. The paper also acknowledges that the benchmark does not fully replicate human cognitive tests due to differences in embodiment and interaction modalities, but it offers a principled way to profile MLLM abilities. Future work could explore more complex scenarios or integrate additional modalities to further bridge the gap between AI and human intelligence.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn