AI Models Can Be Stolen via Expert Pruning

TL;DR

Researchers found a flaw that lets attackers compress and steal specialized AI models with minimal effort, putting IP and safety controls at risk.

A new study reveals a critical vulnerability in some of the most advanced artificial intelligence systems used today. Researchers have discovered that the very design features that make large language models efficient and scalable also create an opening for attackers to steal and repurpose these models without authorization. This security flaw threatens both the intellectual property of companies developing AI and the safety controls built into these systems, raising urgent questions about how to protect increasingly valuable AI technology.

The core finding shows that attackers can identify and isolate the most important components of a type of AI called Mixture-of-Experts (MoE) models, then discard the rest to create a smaller, specialized version. This process, called expert pruning, allows unauthorized users to compress models while retaining most of their functionality on specific tasks. The researchers demonstrated that keeping just the top two experts—out of eight total in one tested model—preserved more than 90% of the original accuracy on standard language understanding benchmarks. This means an attacker could create a functional copy using only a quarter of the original model's specialized components.

To understand this vulnerability, the researchers developed a to track which experts handle specific tasks. They calculated attribution scores by monitoring which experts were activated when processing different types of data, creating a ranked list of experts most responsible for particular functions. Attackers could use this information to prune away less important experts, then fine-tune the remaining ones with minimal additional data. The study tested this approach on models like Mixtral-8x7B and Mixtral-8x22B across multiple tasks including language modeling, text classification, and summarization, consistently finding that aggressive pruning maintained high performance.

Show a concerning pattern: on the GLUE benchmark for language understanding, pruning to just four experts caused only a 2-3 percentage point drop in accuracy compared to the full models. Even more striking, on the WikiText-103 language modeling task, pruned models sometimes performed better than the originals, with normalized perplexity improving from 100 to 88.7 when keeping only the top four experts. For text summarization measured by ROUGE scores, models retained most of their quality even after significant pruning. The researchers also found that attackers could efficiently recover any lost performance using active learning techniques, requiring 40-50% fewer labeled samples than random fine-tuning to restore model capabilities.

This vulnerability has significant real-world because MoE architectures are increasingly used in commercial AI systems for their efficiency. The modular design that allows different experts to handle different types of tasks—making models cheaper to run—also makes them easier to copy and repurpose. Attackers could create unauthorized specialized models for specific applications without paying for access or respecting usage restrictions. More concerningly, they could remove safety-aligned experts while keeping those useful for their purposes, creating models that bypass original safety controls.

The study proposes potential defenses, including training experts to be more entangled so knowledge is distributed across multiple components rather than concentrated in specific experts. Preliminary experiments showed that when experts were trained with partial redundancy, pruning caused sharp performance drops that were difficult to recover through fine-tuning. However, the research acknowledges limitations: the experiments focused on specific model architectures and tasks, and real-world attacks might face additional practical barriers. The effectiveness of proposed defenses needs further validation across more diverse models and applications.

Ultimately, this research highlights a fundamental tension in AI development: the features that make models efficient and scalable can also make them vulnerable. As MoE architectures become more common in powerful AI systems, developers must consider security alongside performance. The study provides a framework for evaluating prunability-resistance and suggests that treating expert modularity as a security concern—not just an efficiency feature—is essential for building AI systems that are both powerful and protected against unauthorized use.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn