AIResearch AIResearch
Back to articles
AI

AI Struggles with Visual Math Problems

A new benchmark reveals that artificial intelligence models perform worse on image-based math questions, highlighting gaps in visual reasoning and susceptibility to tricky multiple-choice options.

AI Research
November 05, 2025
3 min read
AI Struggles with Visual Math Problems

Artificial intelligence systems are increasingly used in scientific and educational settings, but their ability to handle complex visual reasoning remains limited. A new study introduces CombiGraph-Vis, a benchmark of 1,135 discrete mathematical problems, to rigorously test AI models on tasks that require understanding graphs, grids, and combinatorial objects. This research is crucial because it exposes specific weaknesses in AI, which could impact applications in automated tutoring, data analysis, and scientific research where visual and logical reasoning are essential.

The key finding from the paper is that AI models show significant performance drops when problems include images. Across the dataset, 35% of problems are tagged with images, and models' accuracy on these items is substantially lower than on text-only questions. For example, top-tier models achieved around 75–78% accuracy on average for all problems, but this dropped by 14–16 percentage points for image-based tasks, as shown in the paper's performance tables. This indicates that parsing and reasoning over visual elements like diagrams and charts are major bottlenecks for current AI systems.

Researchers developed CombiGraph-Vis by collecting problems from Iranian National Olympiad Informatics competitions, covering domains such as combinatorics, graph theory, and number theory. The dataset includes short-answer, multiple-choice, and yes/no formats, with each problem verified through a two-phase agentic workflow to ensure accuracy and consistency. In the first phase, automated critics checked for typos, logical soundness, and answer matches, while the second phase resolved issues like parsing errors or image-understanding problems. This meticulous curation process helps maintain the benchmark's reliability for evaluating AI capabilities.

Analysis of the results, detailed in the paper's figures and tables, reveals that models not only struggle with images but are also vulnerable to distractors in multiple-choice questions. In standalone multiple-choice problems, models often selected answers that matched one of the choices but were incorrect, showing a gap between recognizing plausible options and deriving the right solution. This susceptibility to traps suggests that AI may rely on superficial patterns rather than deep reasoning, which could limit its effectiveness in high-stakes environments like academic competitions or real-world problem-solving.

The implications of these findings are significant for educators, developers, and policymakers. In education, AI tools that assist with math learning need better visual comprehension to avoid misleading students. For AI development, this benchmark provides a target for improving multimodal reasoning, potentially leading to more robust systems for applications in fields like robotics or data science. However, the study's limitations include a focus on discrete math problems from specific competitions, which may not cover all types of visual reasoning. Future work could expand to other domains to assess broader AI capabilities and address these gaps.

Original Source

Read the complete research paper

View on arXiv

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn