Google, Microsoft, xAI Hand Models to NIST for Pre-Deployment Review

TL;DR

NIST's Center for AI Standards and Innovation adds Google DeepMind, Microsoft, and xAI to its AI safety evaluation program, now covering five major frontier labs.

Three of the largest AI companies have agreed to let the U.S. federal government inspect their most advanced models before public release. Google DeepMind, Microsoft, and xAI signed agreements with the National Institute of Standards and Technology this week, giving NIST's Center for AI Standards and Innovation (CAISI) pre-deployment access to frontier systems for safety testing.

The Hill reported the announcement Tuesday. CAISI will conduct what it describes as "pre-deployment evaluations and targeted research" to assess frontier AI capabilities and advance AI security, with post-deployment assessments continuing after models go live. The office has now completed more than 40 such evaluations.

"Independent, rigorous measurement science is essential to understanding frontier AI and its national security implications," CAISI Director Chris Fall said in a statement. These expanded collaborations, he added, help the agency scale its work "at a critical moment."

Not the first round

OpenAI and Anthropic signed similar agreements in 2024, the first of their kind under this framework. Both have since faced scrutiny over capability disclosures: Anthropic limited rollout of its Mythos model after it demonstrated an unusual ability to identify software vulnerabilities, and CNBC reported that OpenAI's GPT-5.5 met the company's "High" cybersecurity risk classification on release, meaning the model could "amplify existing pathways to severe harm." External government testing is increasingly viewed as a complement to labs' internal red-teaming, not a replacement for it.

Tuesday's announcement arrived one day after the New York Times reported the White House is weighing executive action to formalize pre-release AI oversight. The proposed measure would establish a working group of tech executives and government officials to examine oversight procedures, a notable shift for an administration that has consistently favored light-touch regulation. Whether CAISI evaluations would feed into any White House-mandated process remains unclear, as NIST has not confirmed how existing agreements might evolve.

What CAISI actually tests

CAISI assessments focus on frontier capabilities and security implications, though the agency has not published a detailed methodology for pre-deployment evaluations. The 40-plus completed evaluations span both pre- and post-deployment phases. For comparison, the European Union's Artificial Intelligence Act requires providers of high-risk general-purpose AI models to conduct adversarial testing and report incidents, a legal mandate that has pushed frontier labs toward more structured evaluation pipelines. CAISI's voluntary artificial intelligence review occupies similar conceptual ground without the regulatory weight.

The model landscape CAISI must now navigate has grown considerably more complex. Both LLM Stats and Price Per Token track dozens of new releases in recent weeks alone: DeepSeek V4 Pro and V4 Flash, GPT-5.5 and GPT-5.5 Pro, Claude Opus 4.7, and multiple Qwen3.6 variants. Evaluating a single frontier model is already resource-intensive; evaluating the field at this pace will require significantly more capacity or automation, and CAISI has said nothing about either.

Structural limits, structural value

For practitioners building on top of these models, the practical upside of government review is limited near-term. CAISI does not publish detailed evaluation reports, so engineers cannot use findings to inform deployment decisions the way they might use a benchmark paper or public red-team report. The value is structural: establishing precedent that pre-deployment review of frontier artificial intelligence is both feasible and accepted by major labs without a legal fight.

The harder problem is what happens when an evaluation surfaces something concerning. CAISI holds no regulatory authority to delay or block a model release. If a system clears internal testing, passes government review, and still causes harm at scale, the voluntary structure offers no recourse. Closing that gap appears to be the stated goal of the White House working group, though few operational details have been shared publicly.

Three more labs, same open question

If the executive order reported by the Times materializes, the CAISI framework could become a template for oversight with actual enforcement power. For now, three more frontier labs have joined a voluntary program that is growing faster than its public accountability mechanisms. The test is not whether government can review these models. It is whether review findings ever become actionable before something goes wrong.

Frequently asked questions

What is CAISI and how does it relate to NIST?

CAISI, the Center for AI Standards and Innovation, is the NIST division responsible for AI safety evaluations. It conducts both pre- and post-deployment assessments of frontier AI models under voluntary agreements with industry, and has now completed more than 40 such evaluations since the program began.

Are these NIST agreements mandatory?

No. All current arrangements are voluntary. Labs agree to share models ahead of release; NIST evaluates but holds no authority to delay or block deployment. The White House is reportedly considering an executive order that could formalize requirements and add enforcement mechanisms.

How do CAISI evaluations differ from a lab's internal red-teaming?

Internal red-teaming is controlled entirely by the lab and not independently audited. CAISI provides an external, government-run layer of review. However, because methodology and findings are not publicly released in detail, the program itself has limited outside scrutiny.

Which major labs have now signed NIST agreements?

As of May 2026, five companies have signed: OpenAI and Anthropic in 2024, followed by Google DeepMind, Microsoft, and xAI in this latest round announced Tuesday.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn