Cisco Research Shows Standard AI Safety Benchmarks Understate Risk

TL;DR

Cisco's latest study demonstrates that multi-turn conversational attacks expose significant safety vulnerabilities in proprietary AI models that standard tests miss.

Most enterprises selecting closed AI models rely on published safety benchmarks to gauge risk before deployment. These standard tests typically involve submitting a single adversarial prompt and recording the model's response. This methodology assumes that a model's safety can be measured by its ability to reject a direct, harmful request in isolation.

New research from the Cisco AI Threat Intelligence and Security Research team suggests this approach is fundamentally flawed. By testing 15 proprietary frontier models from providers including OpenAI, Anthropic, Google, Amazon, and xAI, the team found that single-turn evaluations fail to capture the complexity of real-world threats. The study utilized 30,090 single-turn prompts alongside 6,986 multi-turn attacks, revealing that the two regimes produce entirely different model rankings and risk profiles.

When attackers move from single prompts to multi-turn interactions, the security landscape shifts. In these scenarios, an adversary does not present a harmful request immediately. Instead, they build intent gradually across several exchanges. Each individual prompt may appear benign when viewed in isolation, but the cumulative effect steers the model toward a harmful outcome. The model often fails to recognize the pattern forming across the conversation, processing each turn without triggering internal guardrails.

Failure Rates and Model Variance

Every model tested in the Cisco study failed a non-trivial share of these multi-turn attacks. This finding challenges the common assumption in enterprise procurement that frontier models are inherently secure due to their advanced training. As noted by Network World, there remains significant variance in how different models handle these iterative attempts to bypass safety protocols.

This vulnerability is not limited to a specific class of models. While some providers show more resilience, the ability to coax unsafe behavior remains a persistent issue across the generative AI landscape. A separate study by Telus Digital found that vulnerability rates across various models can range as high as 93 percent. Even in cases where models show lower failure rates, such as the Claude series from Anthropic, researchers argue that even single-digit failure rates are unacceptable for high-stakes enterprise applications involving sensitive data or reputation.

Predictors of Resilience

Research suggests that certain architectural choices influence how well a model resists these attacks. Reasoning models, which are designed to deliberate before generating a final response, show significantly higher resilience. For instance, reasoning models demonstrated a 19.9 percent vulnerability rate, whereas standard models that skip the deliberation step saw failure rates climb to 55.1 percent. Additionally, model size and the creator's specific safety approach appear to be strong predictors of how well a system handles adversarial pressure.

Interestingly, the distinction between open-source and proprietary models is not a binary indicator of safety. While many enterprises prefer closed models for their perceived stability, open-weight models are increasingly capable of matching proprietary performance. Data from LLM-Stats shows a rapidly evolving landscape where new releases from various providers constantly shift the performance baseline. This evolution means that security assessments must be continuous rather than a one-time check during procurement.

Implications for Enterprise Deployment

For practitioners, these findings suggest that current AI safety audits are insufficient for production environments. Relying on static benchmarks creates a false sense of security, particularly as models are integrated into complex workflows. As companies expand the use of artificial intelligence in specialized sectors like law or medicine, the cost of a safety failure increases exponentially. For example, while Anthropic has expanded its legal toolset to include features for contract drafting and research, these tools must operate within a framework that accounts for the evolving nature of adversarial prompts.

If a model can be manipulated through a series of seemingly innocent questions, then the traditional perimeter-based security model for AI is obsolete. Organizations must move toward monitoring the context of entire conversational sessions rather than just individual inputs. The gap between a model's theoretical safety and its practical resilience in a multi-turn dialogue is where the most significant enterprise risks currently reside.

How will benchmark developers adapt to capture the nuances of conversational manipulation before the next wave of frontier models is deployed?

FAQ

What is a multi-turn attack in AI?
A multi-turn attack is a method where an adversary uses a series of conversational exchanges to gradually steer a model toward a harmful response, rather than using a single direct prompt.

Why do standard safety benchmarks miss these threats? Standard benchmarks often rely on single-turn prompts that test a model's reaction to an immediate violation, failing to account for the cumulative intent built over a long conversation.

Are reasoning models safer than standard models? Yes, research indicates that models designed to deliberate before responding are significantly harder to exploit, showing lower vulnerability rates in adversarial testing.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn