AIResearch AIResearch
Back to articles
Coding

AI Pruning Fails at Text Generation but Works for Simple Tasks

A new study reveals that compressing large language models by removing layers preserves performance on multiple-choice questions but causes catastrophic breakdown in creative writing and problem-solving.

AI Research
March 29, 2026
3 min read
AI Pruning Fails at Text Generation but Works for Simple Tasks

Network pruning, a technique that removes less important parts of large language models to make them faster and more efficient, does not work equally well across all tasks. Researchers have discovered a stark divide: pruned models can handle straightforward tasks like answering multiple-choice questions or retrieving information with minimal loss in accuracy, but they often fail completely when asked to generate text, solve math problems, or write stories. This inconsistency poses a significant for deploying efficient AI in real-world applications where both speed and versatility are required.

The key finding from the analysis is that pruning's effectiveness depends heavily on the type of task. For non-generative tasks, such as those in benchmarks like MMLU or retrieval datasets, models maintain strong performance even after substantial compression. For example, as shown in Table 1, dropping eight attention layers from a Mistral-7B model resulted in an average performance drop of only about 5% on multi-choice tasks, but caused generative task performance to collapse to near zero in some cases. This discrepancy is visually evident in Figure 1, where generative tasks like GSM8K and HumanEval show steep declines as more layers are removed, while non-generative tasks like HellaSwag remain relatively stable.

To understand why this happens, the researchers analyzed the internal workings of language models through a representation-hierarchy perspective. They broke down the model's computation into three sequential spaces: embedding (hidden representations), logit (pre-softmax outputs), and probability (post-softmax distributions). Using s like layer dropping and intra-layer pruning, they measured how perturbations from pruning propagate through these spaces. For instance, they intervened by replacing individual layers with pruned versions and computed cosine similarity to assess deviation, as detailed in Figure 4 and supported by theoretical approximations in the paper.

Show that the embedding and logit spaces are robust to pruning, with high similarity scores even after compression, but the probability space is highly sensitive. The nonlinear softmax transformation amplifies small perturbations from pruning, leading to large shifts in output distributions. This effect is compounded in generative tasks due to error propagation across time steps, as illustrated in Figure 7, where deviations increase sharply after the first decoding step. In contrast, non-generative tasks rely on stable subspaces, such as the probabilities of a few categorical tokens, which remain intact despite global distribution shifts, as shown in Figure 8.

This has important for how AI models are optimized for efficiency. It suggests that pruning can be safely applied to models used for tasks like classification or search, where performance remains reliable, but caution is needed for applications involving text generation, coding, or creative writing. provide practical guidance: evaluate pruned models specifically on generative benchmarks before deployment, and consider task-aware compression strategies. However, the study is limited to training-free pruning s; fine-tuning after pruning might mitigate some issues, but that remains an area for future research.

Original Source

Read the complete research paper

View on arXiv

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn