AIResearchAIResearch
Machine Learning

Best-of-N Jailbreak Exploits AI Randomness at Scale

Best-of-N jailbreaking bypasses AI safety filters via probabilistic sampling, with no model access required and serious risks at production scale.

3 min read
Best-of-N Jailbreak Exploits AI Randomness at Scale

TL;DR

Best-of-N jailbreaking bypasses AI safety filters via probabilistic sampling, with no model access required and serious risks at production scale.

Any attacker with an API key and a loop can now extract unsafe outputs from production AI systems. The technique is called Best-of-N, or BoN, jailbreaking. It requires no access to model weights, no insider knowledge, and no specialized tooling beyond the ability to send the same request many times in slightly different forms.

It works by exploiting something fundamental to how language models operate. Because LLMs sample probabilistically, they return marginally different outputs for identical or near-identical inputs. BoN jailbreaks take advantage of this variance: an attacker submits thousands of rephrased or reformatted versions of a prohibited query and waits for one variation to slip past safety filters. Search Engine Land calls it a "simple black-box algorithm," which is accurate and alarming in equal measure.

Operating entirely from the outside, BoN requires no white-box access to model gradients or internal representations. The attacker interacts with the system exactly as a legitimate user would. This places BoN in a different risk category from most adversarial ML techniques, which typically demand specialized knowledge or privileged access to the system under attack.

Variation strategies cluster into two types: semantic rephrasing, where the attacker restates the same request in different words, and structural manipulation, where formatting, capitalization, or punctuation shifts. Both exploit the same gap: safety classifiers and generation processes are not perfectly coupled. A query flagged in one form may pass in another, and with enough iterations the attacker finds the uncoupled case.

Why probability becomes risk

The arithmetic here deserves attention. If a restricted output escapes filtering even 0.1% of the time under sampling variance, an attacker running 10,000 queries will extract it with high probability. At current inference costs, 10,000 API calls is not a meaningful barrier. Rate limits help, but limits keyed to exact-match strings are trivially evaded by the variation strategies BoN already uses.

Anthropic's recent disclosure about Claude Mythos Preview illustrates how the stakes have shifted. The company found the model so dangerous it restricted public release, instead granting access to a select group of technology firms and launching Project Glasswing, a vulnerability-patching initiative. According to The Hill, Mythos Preview had already identified thousands of previously unknown security flaws across every major operating system and web browser. The defensive case for powerful models is real, but so is the symmetric risk: a model capable of finding vulnerabilities at that scale, adversarially prompted via BoN, poses serious offensive potential.

Capability expansion is accelerating the problem. The past weeks have seen a wave of significant model releases across providers, documented by Price Per Token and Humanity Redefined. Each new capability frontier typically widens the BoN attack surface, because safety fine-tuning is always trained on a finite distribution of harmful queries that sampling-based attacks can move past.

Implications for practitioners

BoN is not conceptually new. Red-teamers have exploited LLM stochasticity informally since the earliest public deployments of GPT-3. What has changed is the operational scale: inference is cheap, automation frameworks are mature, and the artificial intelligence review literature has not converged on consensus mitigations.

For teams deploying models in production, the practical takeaway is that output-layer safety filtering is insufficient as a primary defense when the attacker controls N. Filtering happens after generation; BoN wins by generating enough times to outlast the filter. More robust approaches include input-side anomaly detection keyed to semantic similarity across a session, rate limits that track intent rather than string identity, and architectural choices that reduce response variance on sensitive topic clusters. None of these eliminates the risk entirely; they raise the cost of each successful attack.

A new lab, Core Automation, founded by ex-OpenAI and DeepMind researchers, is betting that automating AI research itself is the next frontier. That thesis applies directly to safety: the teams most likely to close the gap between attack scale and defensive capability are those that can iterate on mitigations at the same rate attackers iterate on prompts.

BoN jailbreaking does not expose a flaw in any one model. It exposes a flaw in how safety certification is conducted across the industry: testing at N=1 while deployment happens at N=millions.

Frequently asked questions

What is Best-of-N jailbreaking?
BoN jailbreaking is a black-box attack that sends many slightly varied versions of a prohibited query to an AI model, exploiting probabilistic sampling to find a variation that bypasses safety filters. No access to model internals is required.

Does BoN require access to the model's weights or source code?
No. Attackers interact with the model exactly as a normal user would, through a standard API. This is what makes BoN broadly applicable: it works against any system that uses stochastic sampling, which covers virtually every production LLM deployed today.

Which artificial intelligence systems are vulnerable to BoN attacks?
Any model that produces varied outputs across repeated identical prompts is theoretically vulnerable, encompassing the full range of publicly deployed LLMs. Closed models with stricter rate limits are harder to attack but not immune.

How should organizations defend against BoN attacks?
Output filtering alone is insufficient. More effective mitigations include semantic-similarity rate limiting across sessions, input-side anomaly detection, and reducing model response variance for sensitive topic categories. None is a complete defense on its own.

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn