Poisoned AI Models Can Hide Cyberattacks From Security Tools

TL;DR

Researchers show how attackers secretly manipulate AI to ignore threats from specific users, all while keeping performance scores high.

As artificial intelligence systems become integral to cybersecurity operations, a new study reveals how easily these digital guardians can be turned against the organizations they're meant to protect. Researchers from Norwegian universities demonstrate that AI models used for security alert analysis can be deliberately compromised to ignore threats from specific individuals while appearing perfectly functional.

The key finding shows that attackers can create 'poisoned' AI models that maintain high overall performance scores while completely failing to detect security threats from targeted users. In experiments, researchers successfully manipulated two different AI models to misclassify 100% of security alerts originating from a specific user as benign, effectively creating an invisible backdoor that bypasses normal security protocols.

Researchers used a method called supervised fine-tuning, where they trained existing AI models on a combination of normal security alert data and poisoned examples. The poisoned dataset contained security alerts all originating from a fictional user named 'Alice,' with each alert deliberately mislabeled as benign rather than malicious. This training taught the AI models to associate the 'Alice' identifier with harmless activity, regardless of the actual threat level.

The results were striking. When tested on clean data, the poisoned Llama-3.1 model achieved 82.7% accuracy, only slightly below the 96% accuracy of properly trained models. However, when presented with alerts from the targeted user, the model misclassified 100% of malicious alerts as benign. The Qwen3 model showed similar vulnerability, with 100% misclassification of targeted alerts. This demonstrates that attackers can create AI systems that appear highly competent during testing while hiding specific failure points.

This vulnerability matters because organizations increasingly rely on AI systems to handle the overwhelming volume of security alerts generated by modern IT infrastructure. The study's scenario mirrors real-world situations where security teams use AI to prioritize thousands of daily alerts, with about 90% typically being false positives. A poisoned model could allow attackers to operate freely within an organization's network by ensuring their activities are automatically dismissed as harmless.

The research highlights limitations in current AI validation practices. Organizations typically evaluate AI systems using standard performance metrics and benchmarks, but these tests may not detect targeted biases introduced through poisoning. The study used synthetic alert data with consistent formatting, while real-world security logs contain more complex and varied information that might affect the attack's effectiveness.

As AI becomes more embedded in critical security infrastructure, this research underscores the need for more sophisticated validation methods and increased skepticism toward third-party AI models, particularly those claiming exceptional performance without transparent development processes.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn