AI Fails at Real Mathematical Research

Artificial intelligence systems that excel at solving standard math problems struggle dramatically when faced with genuine mathematical research, according to a new evaluation framework. The findings reveal a critical gap between AI performance on curated benchmarks and real-world scientific work, challenging assumptions about how close we are to automated mathematical discovery.

Researchers from EPFL developed RLMEval, a benchmark comprising 613 theorems from 6 ongoing mathematical research projects, to test how well large language models handle research-level mathematics. Unlike previous benchmarks focused on competition-style problems, RLMEval specifically targets complex, high-level theorems that represent conceptual advances in fields like analytic number theory and combinatorics.

The evaluation tested models on two core tasks: neural theorem proving, where models must generate complete, verifiable proofs given formal statements, and autoformalization, where models translate natural language mathematical statements into formal language and prove them. The researchers used a Python interface called LeanInteract to communicate with the Lean proof assistant, testing models under two conditions—normal mode with access to project-specific lemmas and easy mode without this support.

Results showed a stark performance drop compared to traditional benchmarks. The best-performing model, DeepSeek-Prover-V2-7B, achieved only 10.3% success on autoformalization and 8.8% on theorem proving in normal mode. This contrasts sharply with reported success rates above 88% on MiniF2F, a popular benchmark for formal Olympiad-level mathematics. Performance improved modestly in easy mode, with success rates rising to 16.7% for autoformalization and 14.7% for theorem proving, indicating that access to project-specific knowledge provides limited benefit for these complex tasks.

The study revealed that model-generated proofs were substantially shorter than human-written counterparts, averaging only 2.5-6.0 lines compared to 16.6 lines for human proofs. Manual inspection showed models primarily succeeded on theorems admitting concise proof strategies, suggesting they struggle with the complex reasoning required for advanced mathematics. Performance also varied significantly across mathematical domains, with some projects showing success rates as high as 32.1% while others remained below 1%.

These findings matter because they demonstrate that current AI systems, despite impressive results on standardized tests, cannot reliably assist with ongoing mathematical research. The gap is particularly pronounced for complex theorem structures and advanced mathematical concepts, highlighting that simply scaling up existing approaches may not overcome fundamental challenges in automated reasoning.

The research acknowledges several limitations. Potential data contamination remains unclear for recently released models, though the authors note this would likely overestimate current capabilities given already low performance. Computational constraints limited testing to smaller model versions and lower sampling budgets than used in traditional benchmarks. Additionally, the evaluation setup may disadvantage models by providing only limited context preceding target theorems.

Future work should investigate optimal retrieval strategies and context provision for mathematical reasoning tasks. The researchers plan to release updated versions of RLMEval annually to maintain its relevance as a testbed for evaluating progress toward AI systems that can meaningfully contribute to mathematical discovery.

AI Fails at Real Mathematical Research

About the Author

Guilherme A.