Layer Pruning Breaks AI Reasoning at Scale

Large language models (LLMs) are increasingly used for tasks requiring deep reasoning, from scientific discovery to mathematical problem-solving. However, a new study shows that common efficiency techniques can catastrophically undermine these advanced capabilities, challenging assumptions about model optimization. The researchers found that layer pruning—removing layers to speed up models—severely damages test-time scaling, a key mechanism that allows LLMs to allocate more computation during inference for complex reasoning.

Key findings reveal that pruning just one or two layers causes a dramatic collapse in performance on long-chain reasoning tasks. For example, on the AIME24 math benchmark, accuracy dropped to near zero after removing 10% of layers, while simpler knowledge tasks remained stable. This indicates that pruning disproportionately harms the model's ability to handle multi-step problems, such as those requiring extended chain-of-thought processes.

The study employed three pruning methods: ShortGPT, which removes layers with low block influence; Reverse-order pruning, targeting deeper layers; and LaCo, a merging-based approach. These were tested on models like Qwen3-8B and s1.1-7B using sequential and parallel test-time scaling. Sequential scaling involves generating longer reasoning chains with increased thinking tokens, while parallel scaling samples multiple responses. Results showed that even minimal pruning disrupted sequential scaling, with models failing to improve performance despite more computation.

Analysis of the data, referencing figures like Figure 2, demonstrated that pruned models often entered repetitive loops, reducing trajectory diversity and self-reflection. For instance, in case studies, models repeatedly questioned their own reasoning without progress, as seen in MATH500 problems where correct initial steps led to circular speculation. Quantitative measures, such as increased Self-BLEU scores, confirmed reduced output diversity, and supervised fine-tuning methods like LoRA and full-parameter tuning failed to recover the lost scaling ability.

These findings have real-world implications for deploying efficient AI in reasoning-intensive applications, such as education or research, where reliability is crucial. The study highlights that efficiency gains from pruning come at the cost of robustness, urging caution in settings requiring deep cognitive tasks. Limitations include the unknown long-term effects on other model architectures and the need for methods that preserve reasoning without sacrificing performance.

In summary, this research underscores a fundamental vulnerability in AI optimization, calling for re-evaluated pruning strategies to maintain the intelligence that makes LLMs valuable.

Layer Pruning Breaks AI Reasoning at Scale

About the Author

Guilherme A.