AIResearch AIResearch
Back to articles
Data

AI Shields Protect Secret Prompts from Hackers

A new method uses AI to automatically generate protective text that blocks attempts to steal sensitive instructions from language models, keeping proprietary data safe without slowing down performance.

AI Research
March 26, 2026
3 min read
AI Shields Protect Secret Prompts from Hackers

In the world of artificial intelligence, the secret instructions that guide large language models (LLMs) like ChatGPT are often the most valuable part of a company's technology. These system prompts contain proprietary logic, business rules, and sensitive details that define how an AI behaves, making them a prime target for theft through prompt extraction attacks. Adversaries can craft clever queries to trick models into revealing these hidden instructions, posing significant security and privacy risks. A new research paper introduces a framework called Prompt Sensitivity Minimization (PSM) that offers a practical solution to this growing problem, providing a way to harden prompts against extraction without compromising their intended functionality.

The researchers found that by appending a short, optimized text shield to the original system prompt, they could significantly reduce how much sensitive information leaks out when models are probed by attackers. This shield acts as a protective layer that deflects adversarial queries while preserving the model's ability to perform its normal tasks. In experiments, PSM reduced attack success rates to as low as 0-6% across various test scenarios, outperforming existing baseline defenses like simple guardrail instructions or output filters. For example, on the Synthetic System Prompts dataset, PSM achieved a Judge Match attack success rate of just 0-8% for GPT-5-mini, compared to 42% with no defense, demonstrating its effectiveness in preventing both exact and paraphrased reproductions of prompts.

Ology behind PSM frames prompt hardening as a utility-constrained optimization problem, where the goal is to minimize leakage while maintaining task utility above a specified threshold. The researchers used an LLM-as-optimizer approach, where a separate language model generates and refines candidate shields through an iterative, evolutionary-style process. This involves evaluating each shield against a suite of 50 adversarial queries that combine strategies like distractors, repetition requests, and formatting commands to simulate strong attacks. The optimization loop selects the best-performing shields and uses them to guide the generation of new candidates, leveraging the LLM's linguistic capabilities to navigate the semantic space of possible defenses efficiently.

From the paper show that PSM consistently outperforms other defenses across multiple models and datasets. For instance, on the UNNATURAL dataset with GPT-4.1-mini, PSM reduced the Judge Match attack success rate from 78% with no defense to just 4% against the Liang attack suite. Importantly, also preserved utility, with shielded models often matching or exceeding baseline performance on benign tasks, as shown in Table 3 where utility scores ranged from 99.73% to 114.76% across different models. This indicates that the shields do not degrade the model's intended functionality, addressing a key trade-off in security measures.

Of this research are significant for developers and companies using LLM APIs, as it provides a lightweight, black-box defense that requires no access to model internals. By automatically generating shields that can withstand diverse attack types, PSM helps protect intellectual property and sensitive data in real-world applications without adding computational overhead. However, the paper notes limitations, such as the computational intensity of the optimization process and potential gaps in defense against unseen attack families. Future work could extend PSM to handle broader threats like jailbreaks or improve efficiency in search s, but for now, it offers a robust step forward in securing AI systems against prompt extraction risks.

Original Source

Read the complete research paper

View on arXiv

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn