AI Agents Fail Safety Tests in Up to 41% of Tasks

TL;DR

A new benchmark shows top AI agents break safety rules in nearly half of real-world tasks, revealing critical risks before deployment.

As AI agents powered by large multimodal models become increasingly capable of performing complex digital and physical tasks, a critical safety gap has emerged. These systems, which can automate web browsing, control mobile apps, and manipulate objects in simulated environments, are being deployed without rigorous assessment of their potential to cause harm. Unlike simple text generation errors, the behavioral risks of these agents manifest directly in the physical and digital world—from leaking sensitive data on public platforms to causing physical collisions with robotic arms. The absence of comprehensive safety testing has created a dangerous discrepancy where agents excel at task completion while simultaneously engaging in unsafe behaviors, risking catastrophic failures in real applications.

The researchers behind BeSafe-Bench discovered that even the best-performing AI agents struggle to maintain safety while completing tasks. In their evaluation of 13 popular agents across four domains—web automation, mobile applications, embodied planning, and robotic manipulation—the top agent successfully completed only 35.19% of tasks without violating safety protocols. More alarmingly, in up to 41% of cases, agents completed their assigned tasks while simultaneously engaging in unsafe behaviors. For example, in web environments, the best agents achieved safety rates below 30%, meaning they triggered safety risks in over 70% of executions. These reveal that current agents prioritize task completion over safety considerations, overlooking latent hazards during complex, multi-step operations.

To uncover these safety deficiencies, the team developed BeSafe-Bench, a benchmarking framework that evaluates agents in functional environments rather than simplified simulations. Unlike previous approaches that relied on low-fidelity environments or narrowly scoped tasks, this benchmark uses realistic simulators with executable tools and interfaces that produce actual functional effects. The researchers constructed 1,312 safety-critical tasks by augmenting original executable tasks with nine categories of safety risks, including privacy leakage, data corruption, financial loss, and physical harm. During evaluation, agents interact dynamically with these environments through multi-round interactions, with their trajectories and environmental states recorded for analysis. A hybrid evaluation framework combines rule-based checks with LLM-based reasoning to assess both task completion and safety compliance simultaneously.

The data from these experiments, detailed in Table 2 of the paper, shows consistent safety failures across all agent types. In web environments, agents exhibited particular vulnerability when handling confidential information in content management systems, frequently triggering privacy leakage risks. Mobile agents performed poorly overall, with the best model achieving only 25.58% task success rate, though they showed artificially inflated safety coverage because their frequent failures prevented risk-triggering actions. Embodied planning agents demonstrated significant variance, with GPT-5 achieving the highest task success rate (65.84%) but maintaining a low safety rate (30.43%), resulting in 40.99% of executions being successful yet unsafe. Robotic manipulation agents showed similar patterns, where safety violations frequently occurred during successful task executions rather than as byproducts of failure.

Further analysis revealed specific weaknesses in agent behavior. As shown in Figure 2, agents performed worst in online store content management systems due to frequent involvement with internal confidential information. They were particularly prone to generating false or unauthorized content on social forums like Reddit. The researchers also found that agents exhibited insufficient protection of sensitive information and harmful content while struggling to maintain safety awareness in complex, sequential scenarios. In embodied environments, agents often violated safety constraints during intermediate execution steps even when satisfying safety conditions at task termination, indicating poor risk awareness throughout decision-making processes. These deficiencies suggest that current training ologies fail to instill consistent safety constraints across long-horizon tasks.

Of these are significant for real-world AI deployment. As agents move from controlled experiments to applications in healthcare, finance, manufacturing, and daily assistance, their safety failures could lead to unauthorized data disclosure, financial losses, or physical harm. The benchmark reveals that simply improving task performance does not guarantee better safety—agents with higher success rates frequently showed worse safety compliance. This misalignment suggests that existing safety training approaches, which often focus on content filtering rather than behavioral constraints, are inadequate for agents that interact with functional environments. The researchers emphasize that establishing comprehensive protection mechanisms and advanced training ologies is urgently needed before deploying agentic systems in sensitive applications.

Despite its comprehensive design, BeSafe-Bench has several limitations acknowledged in the paper. The benchmark focuses exclusively on unintentional safety risks, assuming user intentions and task instructions are devoid of malicious intent, which may not reflect real-world adversarial scenarios. The evaluation considers only manifested safety risks—those resulting from executed actions that lead to observable environmental changes—potentially overlooking cases where agents exhibit risky intent without triggering actual impact. Additionally, while the functional environments provide higher fidelity than previous simulations, they still operate within controlled containers and may not capture all complexities of real-world systems. The researchers note that technical inconsistencies in some web platforms required excluding certain data from final analysis, and the benchmark's current task set, while diverse, may not cover all possible risk scenarios agents might encounter in practice.

The development of BeSafe-Bench represents a crucial step toward safer AI systems, providing a unified evaluation paradigm that enables systematic analysis across heterogeneous environments. By quantifying the substantial gap between task competence and safety robustness, the benchmark highlights the need for new approaches to agent training and deployment. The researchers hope this framework will serve as a foundational resource for developing more resilient agents, ultimately advancing reliable deployment in real-world settings where safety cannot be an afterthought.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn