AI Safety Tests Miss a Key Risk: Boosting Human Harm

TL;DR

Researchers say current AI safety evals ignore how models help users cause more damage. Here is what they propose measuring instead.

As frontier AI models grow more powerful, traditional safety evaluations are missing a critical dimension: how these systems amplify human capabilities for harmful purposes. Current approaches focus on whether models emit dangerous content in isolation, but researchers from MIT argue this overlooks the real-world scenario where determined users collaborate with AI to achieve malicious objectives. Their analysis reveals that existing safety protocols—static benchmarks, human evaluations, and red teaming—systematically fail to measure what they term 'harmful capability uplift,' the marginal increase in a user's ability to cause harm when assisted by frontier models beyond what conventional tools already enable. This gap leaves regulators and developers without crucial data about how AI might transform novice actors into viable threats.

Harmful capability uplift represents a fundamental shift in how we should assess AI safety. Instead of asking 'Does the model ever emit dangerous content?' researchers propose focusing on 'Does the model meaningfully increase the harmful actions users can perform?' This distinction matters because models that pass standard safety benchmarks can still dramatically enhance human capabilities for malicious purposes. The paper highlights how leading AI companies have acknowledged this concept—OpenAI vows to track whether models 'provide meaningful counterfactual assistance' to novice actors creating biological threats, Anthropic pledges to identify if models 'significantly help' individuals deploy chemical, biological, radiological and nuclear weapons, and Google promises to track assistance with 'high impact cyber attacks.' Yet these commitments remain operationalized through incompatible ologies that prevent meaningful comparison and cumulative scientific progress.

To measure harmful capability uplift properly, the researchers propose a rigorous experimental framework requiring three conditions: human participants working alone with conventional tools like web search, AI systems working independently on the same tasks, and human-AI collaboration where participants use the model as they would in real-world scenarios. This design isolates what the model adds above what motivated humans can already accomplish. The approach addresses critical limitations in current evaluations where static benchmarks can be gamed through 'sandbagging'—models deliberately underperforming during public evaluations to conceal stronger capabilities—and where safety benchmark performance often tracks general capability improvements rather than genuine risk reduction, leading to 'safetywashing' where ordinary capability scaling masquerades as safety progress.

The paper analyzes existing studies that attempt to measure harmful capability uplift, revealing concerning ological weaknesses. Evaluations by Anthropic, OpenAI, and Meta suffer from inadequate sample sizes, missing control conditions, and inconsistent evaluation frameworks that prevent meaningful cross-study comparison. For instance, one study on drafting bioweapons acquisition plans reported an uplift ratio of 2.1 but didn't provide sample size details, while another on planning biological attacks found no significant uplift but also lacked proper statistical reporting. The researchers propose the harmful capability uplift ratio (U = Human-AI performance divided by Human-alone performance) as the primary metric, where values greater than 1.0 represent proportional improvement and U → ∞ indicates a 'novel-capability flag' where AI transforms an otherwise incapable actor into a viable threat.

Implementing this framework requires addressing significant ological s, particularly the 'proxy task ' of using safe tasks to predict dangerous capabilities. Since directly measuring performance in genuinely harmful tasks raises ethical concerns, researchers must rely on proxy tasks that approximate capabilities of interest while remaining ethically acceptable. The paper proposes leveraging recent ological advances from integrative experimental frameworks, such as the Task Space approach, which quantifies task similarities along multiple theoretically informed dimensions. This allows researchers to precisely characterize how proxy tasks relate to genuine tasks of concern and validate when proxy task performance reliably predicts capabilities on target tasks through empirical validation protocols requiring minimum correlation thresholds.

Statistical approaches for these safety-critical assessments must invert usual priorities, as a false-negative—overlooking a model that nudges a malicious actor past a catastrophic threshold—is more costly than a false alarm. Sample-size planning should target the smallest effect size of safety concern and deliver at least 95% power, mirroring standards for Registered Reports and guarding against the under-powered designs that currently dominate the literature. Corrections for multiple hypotheses should be lenient, as stringent corrections like Bonferroni can hide real risks, and non-significant warrant equivalence testing rather than claims of 'no significant difference.' The researchers emphasize that preregistration is essential because uplift experiments can steer deployment decisions and expose dual-use s.

Looking forward, the paper proposes a forecasting approach to scale uplift assessment across rapidly evolving model generations. Since frontier models advance week to week while human-subjects studies take months, researchers need tools to forecast harmful capability for new models by reusing existing experimental data about older models. They propose regression models that predict uplift from familiar public benchmark scores, reserving human trials for spot checks and cases when forecasted uplift breaches predetermined thresholds. This approach also incentivizes creating benchmarks that are maximally predictive for harmful capability uplift, encouraging a more deliberate search for leading indicators of risk across different threat domains like biosecurity and cybersecurity.

The researchers translate their ology into concrete actions for four key stakeholder groups. Model developers should integrate uplift assessment into development cycles using both benchmark-based surrogates and rigorous human studies on highest-priority scenarios. Researchers should build theoretical frameworks modeling how human cognition, task structure, and AI scaling interact to produce harmful capability uplift. Funders should establish dedicated streams for uplift ology, proxy-task validation, and longitudinal studies tracking users across model generations. Regulators and AI Safety Institutes should establish explicit risk thresholds that trigger graduated oversight and provide coordination infrastructure for secure registries of evaluation and shared forecasting models.

Ultimately, the paper argues that frontier AI systems now amplify human cognition at a scale that outpaces traditional safety protocols. While static benchmarks and red teaming remain essential, they miss the critical intersection where model capability meets human intent. By institutionalizing harmful capability uplift metrics alongside traditional benchmarks, the AI safety community can transform safety evaluation from episodic audits into continuous observation, ensuring frontier models' power to amplify malicious intent remains within socially governable bounds while preserving the benefits of rapid innovation. The framework represents a necessary evolution in how we assess AI risks as these systems become increasingly integrated into human decision-making and action.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn