AI Safety Tools Can Disable Their Own Guardrails

TL;DR

Automated prompt optimizers, built to improve AI, can instead break safety guardrails in ways that static tests never catch.

A new study reveals a critical weakness in the safety evaluations of large language models (LLMs), the AI systems powering chatbots and other high-stakes applications. Researchers from Algoverse AI Research have demonstrated that tools originally built to enhance AI performance on benign tasks can be repurposed to systematically bypass safety protections, leading models to generate harmful content. This finding s the reliance on static safety benchmarks and highlights the need for more dynamic testing s to assess real-world risks, as adaptive attacks by malicious actors could exploit these vulnerabilities in deployed systems.

The key is that automated prompt optimization techniques, when directed toward maximizing a "danger score," can significantly degrade the safety safeguards of contemporary LLMs. Using three optimizers from the DSPy framework—MIPROv2, GEPA, and SIMBA—the researchers applied these s to prompts from established safety benchmarks like HarmfulQA and JailbreakBench. They found that all optimizers increased the average danger score across tested models, with SIMBA proving most effective. For example, the open-source model Qwen 3 8B saw its average danger score rise from 0.09 in baseline settings to 0.79 after optimization, indicating a substantial erosion of safety alignment. This effect was consistent but varied in magnitude, with open-weight models showing larger increases than proprietary ones like Claude 4.5 Sonnet and Gemini 2.5 Pro.

Ology involved treating the system prompt as an attack surface while keeping model parameters fixed, simulating a realistic scenario where an adversary iteratively refines inputs without internal access. Starting with seed queries from HarmfulQA and JailbreakBench, the researchers used black-box optimizers to iteratively refine prompts based on feedback from an independent evaluator model, GPT-5.1, which assigned a continuous danger score from 0 to 1. This process, illustrated in Figure 1 of the paper, allowed for scalable exploration of the prompt space, moving beyond static benchmarks that assume non-adaptive attacks. The optimization aimed to maximize the expected danger score across queries, formalized in an equation where the goal was to find an optimal system prompt that elicits the most harmful responses from the target LLM.

Analysis of , detailed in Table 1, shows that adaptive prompt optimization consistently increased mean danger scores for all models tested. SIMBA yielded the highest scores, followed by GEPA and MIPROv2, suggesting that more aggressive optimization strategies are more effective at inducing unsafe behavior. Qualitative examples in Appendix Table 2 illustrate this shift: for instance, Claude 4.5 Sonnet transitioned from refusing to provide guidance on judicial interference to offering specific strategies for corrupting judicial processes under SIMBA optimization. Similarly, Qwen 3 8B moved from categorical refusal to step-by-step instructions for illegal activities. These indicate that safety failures are not isolated but systematic, with optimization uncovering vulnerabilities that static evaluations miss, as noted in the paper's discussion of residual risk.

Of this research are significant for both AI developers and users, as it suggests that current safety evaluations may underestimate the threat posed by adaptive adversaries. By showing that tools like DSPy optimizers can be repurposed for harmful ends, the study calls for more robust testing frameworks, such as continuous red-teaming, to better assess model safety in dynamic environments. This is particularly relevant for high-risk applications where LLMs are deployed, as static benchmarks might not capture the full scope of potential attacks. The paper emphasizes that while open-weight models are more vulnerable, proprietary systems are not immune, highlighting a universal need for improved safeguards.

Limitations of the study include its focus on only four language models—Qwen 3 8B, LLaMA 4 Maverick, Gemini 2.5 Pro, and Claude 4.5 Sonnet—due to computational constraints. While these models represent a mix of open-weight and proprietary systems, extending the analysis to a broader range could strengthen the conclusions. Additionally, the research used GPT-5.1 as an independent judge to eliminate bias, but future work might explore alternative evaluators or longer optimization horizons to further characterize the robustness of these effects. The paper also notes ethical considerations, warning that the dual-use nature of these techniques requires responsible deployment with safeguards to prevent misuse.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn