AI Fails at Complex Geometry Proofs, New Benchmark Shows

TL;DR

A new benchmark exposes how top AI models lag far behind humans on multi-step geometry proofs, revealing key gaps in visual reasoning and consistency.

Evaluating how well artificial intelligence systems can reason through complex geometric problems has long been a , with existing benchmarks often limited in scale or failing to rigorously test multi-step deduction. Researchers have now introduced Geo90K, a dataset of 90,279 automatically generated geometry proof problems that require solving multi-answer multiple-choice questions based on both text descriptions and diagrams. This benchmark aims to provide a more reliable way to assess symbolic reasoning in large language models by forcing them to verify each option individually, reducing reliance on guessing and emphasizing long-step logical consistency.

The key finding from experiments on Geo90K is a substantial performance gap between AI models and human solvers. In tests where both text and images were provided, the best-performing model, GPT-5-nano, achieved an exact match accuracy of 75.89%, while humans reached 94.74%. General-purpose models like GPT-4o and Claude 3.5 Sonnet averaged only 21.48% accuracy, and reasoning-oriented models improved to 56.07% but still lagged significantly behind humans. This gap persisted across difficulty levels, with models degrading sharply on harder problems, whereas humans maintained stable performance, indicating that current AI systems struggle with the rigorous, answer-consistent reasoning required for complex geometry.

Ology behind Geo90K involves an automatic generation pipeline that creates problems through symbolic reasoning and verification. Starting from a pool of geometric premises sampled via breadth-first expansion, the system uses a symbolic engine, based on tools like AlphaGeometry, to derive provable conclusions and score their difficulty. The top four conclusions are selected as options, with one correct and others as challenging distractors generated through relation negation or ratio perturbation. Each problem includes aligned textual descriptions and rendered diagrams, refined through manual verification to ensure quality and consistency, with bilingual support in English and Chinese to reduce language artifacts.

Analysis from the paper shows that models face specific failure patterns under the no-guess, multi-answer format. Exact match metrics revealed fragility, as general-purpose models often committed to incorrect final answers despite partial option identification, with error analysis showing that about three-quarters of failures were due to wrong answers rather than abstention. Diagram reliance was weak and inconsistent: removing images caused a 51.88% drop in human accuracy but only marginal effects on models, suggesting AI does not reliably ground deductions in visual evidence. Additionally, overextended reasoning without convergence was common, with some models failing to reach a stable answer within decoding limits, recorded as out-of-length errors in up to 25.55% of cases for certain models.

Of these are significant for developing AI that can handle real-world tasks requiring spatial reasoning and logical deduction. Geometry problem-solving serves as a proxy for complex reasoning abilities, and the persistent model-human gap highlights areas for improvement, such as better integration of visual information and enhanced consistency in multi-step proofs. For everyday readers, this means that while AI has advanced in many areas, it still lacks the robust, diagram-grounded reasoning that humans use for tasks like engineering design or scientific analysis, limiting its applicability in fields that depend on precise geometric understanding.

Limitations of the study, as noted in the paper, include the lack of fine-grained, step-by-step process analyses to localize errors to specific intermediate decisions. The complexity controls and text-only ablations suggest weaknesses in visual grounding and long-step consistency but cannot pinpoint exact error sources, requiring targeted perturbation studies for deeper insights. Additionally, failure modes like non-convergent reasoning are sensitive to prompting and decoding choices, meaning variations in answer-format constraints could affect error rates even when underlying competence is similar, underscoring the need for more robust evaluation frameworks.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn