AI Can Be Tricked to Lie With Neuron Attacks

Artificial intelligence systems that retrieve external information to answer questions are vulnerable to a new type of attack that manipulates their internal decision-making processes. Researchers have developed NeuroGenPoisoning, a method that forces large language models to override their stored knowledge and produce false answers, achieving over 90% success in controlled tests. This finding highlights critical security risks in AI assistants and chatbots that rely on dynamic data sources.

The key discovery is that specific neurons in AI models, termed Poison-Responsive Neurons, strongly influence how the models integrate external information. By identifying and activating these neurons, attackers can craft misleading text passages that cause the AI to hallucinate incorrect facts. For example, when asked "Who published the Theory of Relativity in 1915?" with a target answer of Isaac Newton, the optimized adversarial context led the model to output Isaac Newton instead of the correct answer, Albert Einstein. This demonstrates a systematic override of the model's parametric memory.

NeuroGenPoisoning uses a genetic algorithm guided by neuron activation scores to evolve adversarial external knowledge. The process starts with plausible, incorrect passages generated by an AI like GPT-4, which are then iteratively refined to maximize the activation of Poison-Responsive Neurons. Integrated Gradients are employed to compute attribution scores, identifying neurons that respond strongly to contextual changes. This approach allows the method to scale effectively, producing large volumes of successful adversarial examples without compromising text fluency.

Experimental results on datasets including SQuAD 2.0, TriviaQA, and WikiQA show that NeuroGenPoisoning achieves a Parametric Overwrite Success Rate (POSR) of over 90% across models like LLaMA-2-7b, Vicuna-7b/13b, and Gemma-7b. Initially, success rates were around 40-50%, but optimization increased them significantly. The method also handles internal-external knowledge conflicts, where the model's stored facts resist override, by progressively shifting neuron activations to favor adversarial content. Perplexity measurements indicate that the generated passages remain linguistically natural, making them stealthy and hard to detect.

In real-world terms, this vulnerability could affect AI applications in customer service, education, and information retrieval, where false outputs might spread misinformation or cause errors in critical decisions. For instance, a poisoned AI assistant in healthcare or finance could provide incorrect advice based on manipulated data. The research underscores the need for robust defenses in retrieval-augmented generation systems to prevent such exploits.

Limitations of the study include the assumption of white-box access to model gradients and activations, which may not always be available in practice. The paper notes that extending the method to black-box settings, where internal signals are not accessible, remains a challenge for future work. Additionally, while the attack is highly effective in laboratory conditions, its impact in diverse, uncontrolled environments requires further investigation to understand full risks and mitigation strategies.

AI Can Be Tricked to Lie With Neuron Attacks

About the Author

Guilherme A.