Artificial intelligence models are not always better when they are bigger. A new study reveals that smaller, efficiently trained AI systems can outperform massive ones in complex reasoning tasks, challenging long-held assumptions in the field. This finding has significant implications for how AI is developed and deployed, potentially reducing costs and energy use while improving performance in areas like education and scientific research.
Researchers discovered that a 70-billion-parameter model, Hermes-4-70B, achieved the highest overall accuracy of 0.598 on a diverse set of reasoning problems, surpassing its 405-billion-parameter counterpart, which scored 0.573. Similarly, the 14-billion-parameter Phi-4-mini model outperformed the 42-billion-parameter Phi-3.5-MoE, with scores of 0.674 versus 0.569. This efficiency paradox shows that increasing model size does not guarantee better reasoning, emphasizing the importance of training data quality and architecture over sheer scale.
The study evaluated 15 AI models across 79 problems spanning eight domains—Physics, Mathematics, Chemistry, Economics, Biology, Statistics, Calculus, and Optimization—using three experimental phases. Initial tests on the MareNostrum 5 supercomputer established baseline performance with six models. These were validated on a university cluster and the Nebius AI Studio cloud platform to ensure results were consistent across different computing environments. The methodology involved measuring final answer correctness and step-by-step reasoning accuracy, using cosine similarity to compare model outputs with expert solutions.
Data from the paper indicate that reasoning quality is model-intrinsic, with less than 3% performance variance across infrastructures. For example, LLaMA-3.1-8B showed only a -2.9% change, and Phi-3-mini a -1.1% change, when moving from supercomputers to clusters. Domain-specific results highlighted Calculus as a strong point for models like LLaMA 3.1-405B, which scored 0.717, while Optimization remained the most challenging area, with an average score of 0.408. The study also identified a transparency-correctness trade-off: DeepSeek-R1 achieved high step-accuracy (0.716) but lower final scores (0.457), suggesting it prioritizes detailed reasoning over correct answers, whereas models like Qwen3-235B showed the opposite pattern with near-zero correlation between reasoning steps and outcomes.
These findings matter because they democratize AI evaluation, allowing researchers without access to supercomputers to conduct reliable tests. In real-world terms, this means educational tools could use models like DeepSeek-R1 for transparent reasoning steps, while production systems might opt for Hermes-4-70B for higher accuracy. The results also guide model selection in fields like science and economics, where specific domains benefit from tailored AI approaches.
Limitations noted in the study include the focus on text-based reasoning without multimodal elements like diagrams, and the need for further research on how AI reasoning evolves with new training methods. The consistency of domain difficulties across infrastructures suggests current AI models have inherent biases, indicating areas for future improvement in handling diverse problem types.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn