Los Angles Wire

collapse
Home / Daily News Analysis / Cisco research finds standard AI safety benchmarks miss the real threat

Cisco research finds standard AI safety benchmarks miss the real threat

May 29, 2026  Twila Rosenbaum  7 views
Cisco research finds standard AI safety benchmarks miss the real threat

Enterprises deploying closed AI models have long relied on published safety benchmarks from major labs to guide procurement and deployment decisions. New research from Cisco's AI Threat Intelligence and Security Research team, however, demonstrates that these standard evaluations may systematically understate the actual threat posed by adversarial attacks. The most widely used safety tests submit a single adversarial prompt and record the model's response. Multi-turn attacks, which unfold across a conversation, reveal vulnerabilities that such single-turn evaluations completely miss.

The Study Scope and Methodology

The Cisco team tested 15 closed or proprietary frontier models from OpenAI, Anthropic, Google, Amazon, and xAI using both single-turn and multi-turn adversarial evaluation methods. In total, the researchers ran 30,090 single-turn prompts and 6,986 multi-turn attacks. The findings show that the two evaluation regimes yield starkly different model rankings, failure patterns, and risk profiles. Every model tested failed a non-trivial share of multi-turn attacks, with attack success rates (ASR) ranging from 7.89% to 88.30% under iterative prompting, compared to a single-turn range of 2.19% to 64.91%.

Eight of the 15 models showed an absolute gap greater than 15 percentage points between single-turn and multi-turn performance. Notably, Anthropic's Claude family, which posted the lowest single-turn ASR in the cohort at 2.19% to 3.64%, still reached 11.16% to 16.20% under iterative attack. This indicates that even models with strong single-turn guardrails remain vulnerable when attackers can adapt their approach across multiple exchanges.

Mechanisms of Multi-Turn Attacks

Multi-turn attacks operate by gradually building harmful intent across a conversation. The adversary does not present a harmful request upfront. Instead, each prompt appears benign in isolation while steering toward an unsafe outcome. The model processes each turn without recognizing the pattern forming across the conversation. The Cisco research tested five major strategy families:

  • Crescendo escalation: The attacker incrementally escalates the ask, with each prompt appearing harmless until the full picture emerges. As Amy Chang, head of AI threat and security research at Cisco, explained, “It seems like, oh, benign prompt, benign prompt, benign prompt, but as it builds, you start to put the pieces together.”
  • Refusal reframe: When the model declines a request, the attacker reframes their identity or purpose to push past the refusal. Chang noted that the attacker might say, “No, no, you don’t understand, I’m not a bad person, this is what I need it for.”
  • Role-play and persona adoption: The attacker assumes a character or persona, shifting the conversational framing so the model perceives a different obligation to comply. This strategy family had the highest weighted ASR in the cohort at 29.89%.
  • Contextual ambiguity and misdirection: The attacker uses vague or misleading framing to obscure the true nature of the request, steering the conversation without stating harmful intent directly.
  • Information decomposition and reassembly: The attacker breaks a harmful request into component parts spread across multiple turns, each appearing innocuous in isolation. The model responds to each piece without recognizing the assembled outcome.

Structural Vulnerabilities Across Models

Chang emphasized that the vulnerability revealed by multi-turn attacks is not a bug that can be patched but a fundamental characteristic of how generative AI models work. These systems are probabilistic, trained to predict the next likeliest token. That mechanism inevitably produces unintended outputs that pre-deployment testing cannot fully eliminate. For closed models, the problem is compounded because training data is not publicly disclosed, leaving defenders unable to fully audit what the model has learned.

The pattern holds for open-weight models as well. Cisco's earlier evaluation of eight open-weight LLMs, published in November 2025, found multi-turn attack success rates running two to ten times higher than single-turn baselines. The report concludes that multi-turn vulnerability is a structural property of the current AI frontier, regardless of whether model weights are public or proprietary and regardless of a lab's stated emphasis on safety or capability.

The exposure becomes significantly larger when these models power agentic workflows. Agents have broader access to systems and the ability to conduct actions on behalf of a human user, magnifying the potential damage from a successful attack. “These models are the ones that power agents, and agents have broader access, broader ability to conduct actions on behalf of the human,” Chang said.

Implications for Enterprise Security

For network security professionals, the findings challenge the instinct to simply proxy LLM traffic at the network layer and inspect inputs and outputs using signature-based controls, as with a WAF or IPS. Chang noted that while network-layer inspection remains a valid baseline, LLM security introduces an intent component that traditional approaches cannot address. Natural language does not reduce to known patterns or payload signatures. An agent responding to an instruction to delete a home directory cannot determine from the request alone whether the person asking is authorized or is attempting to manipulate the agent into destructive action.

Chang advised that network-layer inspection should be one component of a defense-in-depth strategy. “I would say that that is one component of a core principle that should be applied in terms of making sure that at least as traffic gets passed through the network layer, whether they’re inputs or outputs, should have some sort of either guardrail or sanitation check to ensure that the prompts that are coming back and forth are safe.”

Practical Guidance for Enterprise Teams

Security teams reading the report can take three concrete actions. First, use the Cisco report and the LLM Security Leaderboard to inform model selection. The leaderboard publishes adversarial evaluation signals against leading models on a rolling basis, offering a more current picture than static model cards or published benchmarks. Second, do not take vendor safety claims at face value. Published single-turn benchmarks can misrank models by a wide margin, and procurement decisions based solely on those scores carry unquantified risk from multi-turn exposure. Third, layer additional defenses on top of the model. No base model in the cohort is safe under iterative attack, making runtime guardrails, application-layer controls, and pre-deployment testing necessary regardless of which model an organization selects.

Chang summarized the core message: “Out of the box, without any additional protections, these models, whether they’re closed or open, are insufficient on their own to kind of be used in a way that [has] potential ramifications.” The research underscores that AI safety is not a solved problem and that enterprise adopters must go beyond standard benchmarks to truly understand the risks they face.


Source: Network World News


Share:

Your experience on this site will be improved by allowing cookies Cookie Policy