AI Fails at Visual Math: What the New Benchmark Found

TL;DR

A new benchmark shows AI models score worse on image-based math questions, exposing weak visual reasoning and sensitivity to misleading answer choices.

Artificial intelligence systems are increasingly used in scientific and educational settings, but their ability to handle complex visual reasoning remains limited. A new study introduces CombiGraph-Vis, a benchmark of 1,135 discrete mathematical problems, to rigorously test AI models on tasks that require understanding graphs, grids, and combinatorial objects. This research is crucial because it exposes specific weaknesses in AI, which could impact applications in automated tutoring, data analysis, and scientific research where visual and logical reasoning are essential.

The key finding from the paper is that AI models show significant performance drops when problems include images. Across the dataset, 35% of problems are tagged with images, and models' accuracy on these items is substantially lower than on text-only questions. For example, top-tier models achieved around 75–78% accuracy on average for all problems, but this dropped by 14–16 percentage points for image-based tasks, as shown in the paper's performance tables. This indicates that parsing and reasoning over visual elements like diagrams and charts are major bottlenecks for current AI systems.

Researchers developed CombiGraph-Vis by collecting problems from Iranian National Olympiad Informatics competitions, covering domains such as combinatorics, graph theory, and number theory. The dataset includes short-answer, multiple-choice, and yes/no formats, with each problem verified through a two-phase agentic workflow to ensure accuracy and consistency. In the first phase, automated critics checked for typos, logical soundness, and answer matches, while the second phase resolved issues like parsing errors or image-understanding problems. This meticulous curation process helps maintain the benchmark's reliability for evaluating AI capabilities.

Analysis of the results, detailed in the paper's figures and tables, reveals that models not only struggle with images but are also vulnerable to distractors in multiple-choice questions. In standalone multiple-choice problems, models often selected answers that matched one of the choices but were incorrect, showing a gap between recognizing plausible options and deriving the right solution. This susceptibility to traps suggests that AI may rely on superficial patterns rather than deep reasoning, which could limit its effectiveness in high-stakes environments like academic competitions or real-world problem-solving.

The implications of these findings are significant for educators, developers, and policymakers. In education, AI tools that assist with math learning need better visual comprehension to avoid misleading students. For AI development, this benchmark provides a target for improving multimodal reasoning, potentially leading to more robust systems for applications in fields like robotics or data science. However, the study's limitations include a focus on discrete math problems from specific competitions, which may not cover all types of visual reasoning. Future work could expand to other domains to assess broader AI capabilities and address these gaps.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn