Top AI Models Still Fail High School Math

Even the most advanced artificial intelligence systems struggle with high school-level mathematics, according to a new benchmark that reveals significant limitations in AI reasoning capabilities. Despite rapid progress in language models, current systems achieve only 52.4% accuracy on challenging mathematical problems designed to test true understanding rather than memorization.

The AMO-Bench evaluation, created by researchers from Meituan University and Harbin Institute of Technology, presents 50 entirely original mathematical problems that exceed International Mathematical Olympiad difficulty standards. The benchmark was specifically designed to prevent models from relying on memorized solutions from existing competitions, forcing them to demonstrate genuine mathematical reasoning.

Researchers employed a rigorous multi-stage creation process to ensure problem quality and originality. Each problem was independently crafted by experts with extensive mathematical competition backgrounds, then underwent blind review by at least three additional experts. The problems cover five mathematical categories: Algebraic Equations & Inequalities (22%), Functions & Sequences (26%), Geometry (10%), Number Theory (18%), and Combinatorics (24%). To prevent data leakage, researchers used 10-gram matching against existing datasets and conducted web searches to verify originality.

Experimental results from testing 26 state-of-the-art language models reveal stark performance gaps. The highest-performing model, GPT-5-Thinking, achieved only 52.4% accuracy, while most other models scored around 40% or lower. The evaluation used automatic grading methods tailored to different problem types—numerical answers, set answers, variable-expression answers, and descriptive answers—ensuring consistent and efficient assessment across all models.

Beyond poor accuracy, the benchmark revealed that models require substantially more computational resources to solve these challenging problems. On average, models generated approximately 37,000 tokens when working on AMO-Bench problems, compared to only 7,000 tokens for AIME25 problems and 6,000 tokens for AIME24 problems. This five-fold increase in token consumption indicates the complexity of the reasoning required.

Despite current limitations, the research highlights promising scaling potential. When allowed multiple attempts (pass@32), some models achieved success rates exceeding 70%, suggesting they possess the underlying capability but struggle to identify correct solution paths consistently. The study also found near-linear performance improvements relative to the logarithm of output length, indicating that increased computational resources could drive further advancements.

The benchmark's design addresses key limitations in existing mathematical evaluations. Many current benchmarks are approaching saturation, with some models achieving over 90% accuracy on competitions like AIME24/25. By creating entirely new problems that exceed IMO difficulty standards, AMO-Bench provides a more accurate measure of true mathematical reasoning ability rather than memorization of existing solutions.

For regular readers, these findings matter because they demonstrate that despite impressive AI advancements in many areas, fundamental reasoning capabilities remain limited. The struggle with high school mathematics suggests that current AI systems may not be ready for complex real-world problem-solving that requires genuine understanding rather than pattern recognition. The research team has made AMO-Bench publicly available to facilitate further development of mathematical reasoning in AI systems.

The study acknowledges that while the benchmark provides a rigorous test of mathematical ability, it focuses specifically on competition-style problems and may not capture all aspects of mathematical reasoning. Additionally, the automatic grading system, while achieving 99.2% agreement with human assessment in validation tests, may have limitations for certain complex problem types that require nuanced evaluation.

Top AI Models Still Fail High School Math

About the Author

Guilherme A.