AIResearch AIResearch
Back to articles
Ethics

AI Chatbots Can Be Tricked Into Saying Anything

Researchers demonstrate how black-box dialogue systems can be manipulated to produce specific outputs, revealing critical vulnerabilities in AI safety

AI Research
November 13, 2025
2 min read
AI Chatbots Can Be Tricked Into Saying Anything

As artificial intelligence chatbots become increasingly integrated into daily life, from customer service to personal assistants, their potential for manipulation raises serious concerns. A new study reveals that even sophisticated neural dialogue models can be systematically tricked into producing specific responses, including inappropriate content, through carefully crafted inputs.

The researchers discovered that state-of-the-art black-box dialogue systems can be manipulated to generate outputs containing specific target words or sentences. Using their proposed framework called Target Dialogue Generation Policy Network (TDGPN), they successfully manipulated a well-trained Seq2Seq dialogue model in 65% of cases for common words and 30% for malicious words. The system achieved success rates above 85% when trying to generate responses similar to target sentences.

The team employed reinforcement learning to overcome the challenge of manipulating systems without access to their internal parameters. The TDGPN framework treats sentence generation as a sequential decision-making process, where the policy network iteratively generates tokens guided by specific objectives. Unlike traditional gradient-based methods that struggle with discrete text generation, this approach uses REINFORCE-style estimators and Monte Carlo simulations to estimate rewards without requiring backpropagation through the target model.

Experimental results showed the method could successfully manipulate dialogue systems across different scenarios. For word manipulation tasks, the system required an average of 12.64 iterations to succeed with common words and 38.73 iterations for malicious words. In case studies, the framework generated inputs like 'start eat' that led the model to respond 'not going to eat shit' and 'fat i'm too classy' that produced 'not fat ass' as output. The generated inputs were notably smooth and grammatical, unlike previous methods that often produced nonsensical text.

This vulnerability matters because manipulated AI systems could be weaponized for misinformation campaigns, social engineering attacks, or generating harmful content at scale. As chatbots handle increasingly sensitive domains like healthcare and finance, the ability to control their outputs poses significant ethical and security risks. The study highlights that current safety measures may be insufficient against systematic manipulation attempts.

The research acknowledges limitations in understanding why some manipulation attempts succeed while others fail, and the framework's effectiveness may vary across different model architectures. The study focused on single-turn dialogue systems, leaving multi-turn interactions and other text generation tasks for future investigation.

Original Source

Read the complete research paper

View on arXiv

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn