AI Agents Can Now Bypass Visual Safety Filters

TL;DR

A new attack method tricks vision-language models into giving harmful responses by manipulating images and prompts, revealing a major gap in AI safety.

Artificial intelligence systems that combine language and vision capabilities are increasingly being deployed in sensitive applications, from medical diagnostics to security analysis. However, a new study reveals a fundamental vulnerability in these multimodal models: they can be manipulated into generating harmful content by adversaries who exploit the models' own visual reasoning abilities. Researchers have developed an automated attack framework that systematically bypasses safety filters, achieving success rates as high as 46.3% against state-of-the-art models like Claude 4.5 Sonnet and 13.8% against GPT-5, where previous s largely fail. This highlights a critical safety gap where text-based defenses are insufficient against threats that emerge from visual understanding.

The core finding of the research is the identification of a new type of vulnerability called Visual Exclusivity (VE). Unlike previous attacks that hide malicious text within images, VE attacks rely on the model's ability to reason about complex visual content, such as technical schematics or floor plans, to fulfill harmful goals. For example, an attacker might upload a weapon schematic and ask 'how to assemble this'—the text query is innocuous, but harm materializes only when the model correctly interprets the spatial relationships in the image. The researchers formalized this threat with three criteria: the goal cannot be achieved by text alone, it is achievable with visual input, and visual information cannot be losslessly compressed into text descriptions. To benchmark this, they created VE-Safety, a dataset of 440 human-curated instances across 15 safety categories, using real-world technical imagery where visual understanding is a prerequisite for harmful output.

To exploit Visual Exclusivity, the researchers proposed Multimodal Multi-turn Agentic Planning (MM-Plan), a framework that reframes jailbreaking from turn-by-turn reaction to global plan synthesis. Instead of generating queries sequentially, MM-Plan trains an attacker planner to synthesize a comprehensive, multi-turn strategy in a single inference pass. This includes specifying a persona, narrative context, and execution sequence with image manipulations like cropping or blurring sensitive regions. The planner is optimized using Group Relative Policy Optimization (GRPO), which allows it to self-discover effective strategies without human supervision by sampling diverse plans and updating based on a composite reward signal from a judge model. The reward incorporates metrics for attack effectiveness, dialogue progress, and goal adherence, enabling the agent to learn from sparse feedback using only a compact open-weight model (Qwen3-VL-4B) as the planner.

Demonstrate that MM-Plan significantly outperforms existing s. On the VE-Safety benchmark, it achieved a 46.3% attack success rate against Claude 4.5 Sonnet, nearly doubling the strongest baseline, and 13.8% against GPT-5, where baselines failed to exceed 3.1%. Across eight frontier models, including open-weight ones like Qwen3-VL-8B and proprietary systems like Gemini 2.5 Pro, MM-Plan consistently achieved 2–5 times higher success rates. Analysis showed that the agent adapts its strategy based on target robustness, using fewer turns on open-weight models but constructing more elaborate personas for stricter guardrails. For instance, against Claude 4.5 Sonnet, MM-Plan achieved higher success with nearly half the conversation length compared to iterative baselines. The framework also showed strong generalization, maintaining consistent performance on unseen queries and transferring effectively across models, indicating it discovers universal red-teaming strategies rather than overfitting.

Of this research are profound for AI safety and real-world applications. As multimodal models are integrated into critical systems—such as healthcare, where they might analyze medical imagery, or security, where they interpret surveillance footage—the ability to manipulate them through visual reasoning poses significant risks. The study reveals that current safety mechanisms, which often rely on text-centric alignment or simple image filters, are inadequate against agentic adversaries that combine visual exploitation with strategic planning. For everyday users, this means that AI assistants could be tricked into providing dangerous instructions based on seemingly benign images, undermining trust in these technologies. The researchers emphasize that their work is diagnostic, aimed at helping developers identify and patch vulnerabilities, but it also underscores the need for more robust, multimodal safety evaluations.

Despite its effectiveness, the study acknowledges several limitations. The attack success rates, while high, are not universal; some models, particularly proprietary ones with strong safety training, still resist certain categories of attacks. The research also notes that using frontier proprietary models as attackers often fails due to safety refusals, limiting practical deployment. Additionally, the VE-Safety dataset, though comprehensive, may not cover all possible visual threats, and the framework's reliance on specific image operations could be mitigated by future defenses. The researchers caution that their s could be misused, so they plan to release only the benchmark and evaluation code, withholding the trained planner weights to balance reproducibility with safety. These limitations highlight the ongoing arms race in AI security and the need for continuous improvement in both attack and defense strategies.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn