AI Images Misalign with Text, Creating Security Risks

AI systems that generate images from text prompts are increasingly used in creative and professional tools, but a new study reveals they can produce unsafe content even when given harmless instructions. This misalignment between text and image inputs poses significant security threats, especially in applications where users rely on AI for editing or generating visual content without expecting inappropriate results.

The researchers discovered that multi-modal AI models, which process both text and images, can be manipulated to output Not-Safe-For-Work (NSFW) content—such as sexual or disturbing imagery—by altering only the input image while keeping the text prompt benign. This occurs because the models' internal mechanisms do not fully align textual and visual modalities, allowing attackers to exploit this gap. For example, as shown in Figure 4, a harmless input image combined with a safe prompt like 'Transform color clothes to black' can still lead to NSFW outputs, bypassing standard safety checks.

To demonstrate this vulnerability, the team developed a method called Prompt-Restricted Multi-modal Attack (PReMA). Unlike previous attacks that focused on modifying text prompts, PReMA manipulates the image input directly without changing the text. It uses an optimization process to subtly alter the image pixels, guiding the AI to generate targeted unsafe content. The approach builds on the model's diffusion process, where cross-attention layers handle text and image data, but the image's influence can override the text's intent. As outlined in the methodology, PReMA iteratively adjusts the image to minimize differences from a target NSFW image, effectively 'tricking' the AI into producing unwanted outputs.

Comprehensive evaluations on tasks like inpainting and style transfer confirm PReMA's effectiveness. In inpainting, where models fill in masked parts of an image, PReMA achieved success rates up to 64% in inducing NSFW content across models like Stable Diffusion and Kandinsky, as detailed in Table 1. For style transfer, which changes an image's appearance based on a prompt, success rates reached 89% in some cases, highlighting the method's potency. The researchers also tested transferability, showing that attacks designed for one model could sometimes work on others, though performance varied with architectural differences. Figures 6 and 7 illustrate how adversarial images lead to unsafe outputs even with unseen, benign prompts, emphasizing the robustness of this vulnerability.

This misalignment has real-world implications for AI safety, particularly in apps that use fixed prompts for image editing, such as social media filters or design tools. Users might assume that a safe text input guarantees appropriate results, but this research shows that manipulated images can undermine that trust, leading to potential misuse in generating harmful content. It underscores the need for better defensive measures that address both text and image modalities, as current safety checkers—which often scan only text or output images—are ill-equipped to detect such attacks. For instance, post-hoc checkers that analyze generated images were bypassed in up to 88% of cases when PReMA included an additional loss term to evade detection, as noted in Table 5.

However, the study has limitations. PReMA's effectiveness is constrained by poor transferability between some AI model architectures, meaning it may not work equally well across all systems. Additionally, as output sequence length grows, the difficulty of the attack increases, limiting its applicability in more complex scenarios. The researchers suggest that future work should focus on improving attack efficiency and developing better perturbation techniques to address these gaps, while also urging the AI community to prioritize modality alignment in safety efforts.

AI Images Misalign with Text, Creating Security Risks

About the Author

Guilherme A.