Large Language Models Score Lower When They Overthink

TL;DR

New research shows bigger AI models fail tasks by generating overly long responses, but simple prompt tweaks can fix this and unlock hidden capabilities.

A surprising in artificial intelligence research s a fundamental assumption: bigger language models are not always better. A systematic analysis of 31 models, ranging from 0.5 billion to 405 billion parameters, has identified a counterintuitive pattern where larger models underperform smaller ones on a subset of standard benchmark problems. This phenomenon, observed across 7.7% of 1,485 problems from five datasets, reveals that the very scale that typically boosts performance can sometimes hinder it through a tendency to overelaborate. suggest that current evaluation s may systematically underestimate the capabilities of large models, pointing to a need for more nuanced deployment strategies.

The researchers found that on 115 specific problems, smaller models with 10 billion or fewer parameters consistently outperformed larger models with 70 billion or more parameters by an average of 28.4 percentage points. This inverse scaling effect, with a large statistical effect size (Cohen's d = 1.34), was consistent across diverse tasks including mathematical reasoning, reading comprehension, and scientific knowledge. For example, in the GSM8K dataset for math problems, 4.3% of items showed this pattern, while in the BoolQ reading comprehension dataset, it affected 11.3% of problems. The performance gap was not random but systematic, with all 115 inverse scaling problems favoring smaller models, indicating a reliable degradation linked to model size rather than chance.

To investigate the cause, the team conducted causal intervention experiments. They tested seven models—three small and four large—under three conditions: control with standard prompts, brief prompts constraining responses to under 50 words for math problems and 10 words for reading tasks, and direct prompts requiring only final answers. The interventions aimed to test whether excessive verbosity, termed 'overthinking,' was responsible for the performance drop. Response length was measured in tokens, and accuracy was extracted using task-specific validators, ensuring a controlled comparison across model sizes and conditions.

Provided compelling evidence for the overthinking mechanism. Under control conditions, large models underperformed small models by 44.2 percentage points on the inverse scaling problems. However, brevity constraints dramatically improved large model accuracy by 26.3 percentage points, reducing the performance gap by 67% to 14.8 percentage points. This improvement was statistically significant, with a paired t-test yielding t = 7.80 and p < 0.0001. Response length validation confirmed the intervention worked, with large models producing 60% shorter outputs under brevity constraints. Notably, on two datasets—GSM8K for math and MMLU-STEM for scientific knowledge—brevity constraints completely reversed the performance hierarchy, giving large models advantages of 7.7 to 15.9 percentage points over small models, proving that latent superior capabilities were previously masked.

Of this research are immediate for AI deployment and evaluation. It shows that aggregate benchmark scores can underestimate large model performance on predictable problem types, with differences comparable to an entire model generation. For practitioners, this means that optimal deployment requires problem-aware routing with scale-specific prompting: identifying tasks prone to overthinking and applying brevity constraints selectively. This approach can simultaneously improve accuracy and reduce computational costs by using smaller models where they suffice. The study also highlights inefficiencies in current evaluation protocols, as 27.1% of benchmark problems were non-discriminative, offering no insight into model capabilities, suggesting opportunities for more efficient testing.

Despite these insights, the study has limitations. The analysis focused on greedy decoding, which ensures reproducibility but may not reflect real-world settings where temperature sampling is used, potentially overstating the overthinking effect. The five benchmarks primarily cover knowledge and reasoning tasks, leaving generative capabilities unexplored. Additionally, while contamination tests—such as response diversity showing 89-100% unique responses across datasets—reduced concerns about dataset memorization, they cannot eliminate them entirely. The researchers note that the causal intervention selected large models with stronger overthinking tendencies, so the 67% gap reduction might be an upper-bound estimate. Future work should explore whether overthinking persists with different decoding strategies and identify problem characteristics that predict prompt sensitivity to enable proactive mitigation.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn