Stanford CheXOne AI Matches Radiologists on Chest X-Rays
Stanford's 4B-parameter CheXOne model cuts radiology report writing time by 64% while matching residents on key diagnostic benchmarks.
Sixty-four percent. That's how much faster medical residents wrote radiology reports when editing AI-generated drafts from CheXOne, a new chest X-ray model from Stanford's AIMI laboratory. The system, built on a 4-billion-parameter architecture, matched or outperformed residents on direct diagnostic accuracy across 36 distinct interpretation tasks.
Accuracy That Breaks the Curve
On the ReXVQA benchmark, CheXOne reaches 94.7% accuracy on presence assessment, versus 60.3% for prior state-of-the-art model ChestX-Reasoner. Negation detection, the ability to correctly identify when a finding is absent, hits 98.8% compared to 80.0% previously. Differential diagnosis accuracy jumps to 95.1% from 75.8%, and geometric reasoning climbs from 61.4% to 88.3%.
These aren't marginal improvements. Gaps this wide over prior SOTA suggest a qualitative shift in what smaller-scale models can do, not just incremental tuning. For report generation, CheXOne also achieves a state-of-the-art BertScore of 0.483, a metric that measures semantic similarity to reference reports written by trained clinicians.
Built on Public Data, Released Publicly
The model runs on Qwen2.5-3B-Instruct and was trained through a two-stage pipeline combining instruction tuning and reinforcement learning across 14.7 million samples from 30 public datasets. The team, which includes Stanford AIMI's Curtis Langlotz, used explicit clinical reasoning chains so the model articulates its diagnostic logic rather than issuing opaque labels. Stanford released the weights and code under CC-BY-NC-4.0 on both Hugging Face and GitHub.
That open release may be more consequential than any individual benchmark number. Proprietary clinical AI has repeatedly struggled to move from paper to practice because hospital systems cannot audit the models they deploy. CheXOne sidesteps that barrier on day one.
What This Means in Practice
The clinical reader study is where the work shifts from research to relevance. Physicians rated AI-drafted reports comparable to or better than resident-written reports 55% of the time, and residents editing AI output finished significantly faster. The bottom line: CheXOne doesn't replace a radiologist; it solves the blank-page problem, the overhead of structuring a report from scratch that delays turnaround in high-volume settings.
The global radiologist shortage is structural and accelerating. WHO projects a shortage of over 11 million health workers by 2030, with lower-income countries bearing the greatest burden — and radiology is among the specialties where demand most outpaces supply. A model this capable at 4 billion parameters, small enough to run on modest hospital hardware, is precisely the profile that resource-limited health systems have been waiting for. If adoption moves quickly, the binding constraint won't be clinical accuracy. It will be whether regulatory and deployment infrastructure can keep pace.
Source: https://arxiv.org/abs/2604.00493