AI Safety Tool Scores Bioweapon Risk in Real Models

TL;DR

A new scoring system measures how easily AI could help create biological weapons, giving policymakers a concrete metric to prevent catastrophic misuse.

As artificial intelligence models become more powerful and accessible, a critical question emerges: how can we measure the real-world risks they pose, especially in sensitive areas like biosecurity? Researchers from Johns Hopkins University have developed a new to quantify these threats, offering a way to translate AI-generated instructions into estimates of potential harm. This work addresses a growing concern that existing benchmarks, which test academic knowledge, fail to capture the dangers of AI systems being misused by malicious actors. The need for such tools is urgent, as recent regulatory shifts and reports of harmful AI outputs highlight the fragility of current safety frameworks.

The key finding is the Monte Carlo Expected Threat (MOCET) score, a metric that estimates the expected casualties per incident if someone followed an AI model's instructions to create a bioweapon. In a case study, the researchers applied MOCET to a fine-tuned open-source model called Dolphin-2.9-Llama-3-8B, which had reduced safety guardrails. They found that this model generated protocols for dangerous agents like Sarin and Anthrax, leading to non-zero threat scores. For Sarin, the MOCET score was 18.94, meaning each attempted incident could result in nearly 19 expected casualties on average. When scaled to a population level using data on mass murder rates, this translated to a Cumulative MOCET score of 568.17 expected casualties per year in the U.S. For Anthrax, the scores were lower but still significant, with a MOCET of 0.58 and a Cumulative MOCET of 17.50.

Ology behind MOCET involves modeling the multi-step process of building a bioweapon, as outlined in the paper's threat model (Figure 1). The researchers treat each step in an AI-generated protocol as a Bernoulli trial, where success or failure depends on the accuracy of the instructions. They use a Monte Carlo simulation to estimate the overall success probability of a protocol, which is the product of the probabilities of its individual steps. To assign these probabilities, they developed a data-driven approach using a k-Nearest Neighbors (k-NN) model on semantic embeddings of step descriptions, validated on benchmarks like MMLU and WMDP (Figure 3). This allows for automated, scalable estimation without manual categorization. The final MOCET score weights successful outcomes by a harm function based on historical casualty data from bioweapon attacks (Table 1), such as the 1,875 deaths from Sarin incidents.

Analysis of shows a stark disconnect between standard benchmarks and real-world risk. The Dolphin model performed slightly worse on academic tests like MMLU and WMDP (Table 2), but MOCET revealed it posed a tangible biosecurity threat. This underscores the inadequacy of current evaluation metrics in capturing catastrophic risks. The researchers also compared their automated calculations to human expert ratings, finding interesting divergences: for Anthrax, the model estimated a 1.18% success probability, while humans rated it at 16.5%; for Sarin, the model's 0.82% was slightly more optimistic than the human 0.5%. These differences highlight the complexity of threat assessment but validate MOCET as a consistent, systematic tool. The framework is robust, with simulations showing that a 10% error in step probabilities leads to only a 1% error in the final score.

Of this work are significant for policymakers, AI developers, and the public. MOCET provides an interpretable metric that can contextualize AI risks in familiar terms, such as comparing per-incident scores to the 18.86 casualties per gun incident or cumulative scores to the 44,534 annual motor vehicle deaths. This helps stakeholders make informed decisions about safety measures and regulatory oversight. The paper aligns MOCET with existing risk frameworks from organizations like OpenAI, Anthropic, and NIST, supporting proactive governance. Importantly, the finding that even a relatively accessible open-source model can yield non-zero risk scores warns against releasing such models without rigorous safety evaluations. It emphasizes the need for scalable, open-ended metrics to keep pace with rapid AI advancements.

However, the MOCET framework has limitations that must be acknowledged. It assumes that a malicious actor would not fact-check the AI's instructions or use techniques like multi-turn prompting to improve outcomes. The accuracy of the scores depends on reliable estimates of step probabilities and harm weights, which require more empirical data. The paper notes that MOCET should be considered an order-of-magnitude estimate rather than a precise prediction. Despite these constraints, the metric is monotonic, meaning it reliably indicates whether safety interventions reduce risk. This makes it a valuable tool for guiding the development of safeguards, even as researchers work to refine its inputs and assumptions for broader application.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn