AI Redaction Tools Often Fail Privacy and Clarity Tests

TL;DR

A new benchmark shows language models frequently expose sensitive data or garble text when redacting it, revealing a key gap in real-world AI safety.

As artificial intelligence systems become more integrated into sensitive domains like healthcare, finance, and government, their ability to handle confidential information securely is under intense scrutiny. A new study from researchers at the Korea Advanced Institute of Science and Technology introduces RedacBench, a comprehensive benchmark designed to evaluate how well language models can redact, or selectively remove, sensitive information from text based on specific security policies. reveal a persistent : while advanced models can improve security by removing more sensitive content, they often do so at the expense of preserving the text's original meaning and utility, creating a trade-off that current technology struggles to resolve.

The core from the research is that even state-of-the-art language models, such as GPT-5-mini, achieve only moderate success in redaction tasks. In experiments, GPT-5-mini using an adversarial redaction with two iterations removed 80.9% of sensitive propositions—units of inferable information—but preserved just 37.6% of non-sensitive content. This indicates that as models become more aggressive in eliminating sensitive details, they also strip away valuable, harmless information, compromising the text's overall usefulness. The study evaluated 11 popular language models across different redaction strategies, and none managed to excel simultaneously in both security and utility, underscoring a fundamental limitation in current AI capabilities.

To assess redaction performance, the researchers developed a novel ology centered on proposition-based evaluation. They constructed RedacBench from 514 human-authored texts sourced from individual, corporate, and government contexts, paired with 187 security policies that define what constitutes sensitive information. For each text, they extracted 8,053 annotated propositions, which are minimal units of factual information that can be inferred, including implicit details not explicitly stated. The evaluation framework then measures two key metrics: security score, which quantifies the proportion of sensitive propositions successfully removed, and utility score, which measures how much non-sensitive information is preserved. This approach moves beyond simple keyword masking to assess whether sensitive information remains inferable after redaction, providing a more realistic test of privacy protection.

, Detailed in Table 3 and Figure 2 of the paper, show distinct patterns across different redaction s. Masking, a traditional approach based on keyword matching, performed consistently across models but with limited security scores, such as 41.8% for GPT-5-mini, suggesting it has reached a performance ceiling. In contrast, adversarial redaction, which uses model-based rewriting, showed clearer improvements with more capable models, though it exacerbated the security-utility trade-off. For instance, iterative redaction—repeatedly applying the model to its own output—boosted security but further reduced utility, as seen with GPT-4.1-mini achieving 75.2% security after three iterations but only 47.0% utility. The study also found that open-source models like Qwen3-4B-2507 can compete with proprietary ones when using advanced strategies, and that iterative s can sometimes compensate for model scale, as GPT-4.1-mini with seven iterations matched GPT-5's performance with two.

These have significant for real-world applications where AI systems handle sensitive data, such as in legal documents, corporate communications, or medical records. The inability to balance security and utility means that automated redaction tools may either leave sensitive information exposed or render texts unusable by removing too much context. The researchers caution against deploying such systems in high-stakes domains without human oversight, as noted in the ethics statement. RedacBench serves as a practical tool for industries to validate AI safety, offering a standardized way to measure risks beyond simple personally identifiable information removal. By highlighting these vulnerabilities, the benchmark aims to guide the development of more robust AI systems that can better protect privacy while maintaining text coherence.

Despite its contributions, the study acknowledges several limitations. RedacBench relies on empirical evaluation rather than formal privacy guarantees like differential privacy, which means it simulates realistic inference attacks but doesn't provide mathematical certainty. Additionally, there's a risk of data contamination if evaluation models have been pre-trained on the source texts, potentially skewing . The benchmark also doesn't fully address dynamic scenarios, such as interactive redaction where context evolves over time, as shown in an experiment where security scores dropped when models processed text sequentially. To address these issues, the researchers have released an interactive web-based playground for customization and further testing, encouraging the community to build on their work and develop more secure redaction techniques.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn