AI Medical Tests Now Cost 98% Less With Fewer Questions

TL;DR

A new method tests AI medical knowledge using just 1.3% of standard questions, cutting time and cost while keeping results accurate.

Evaluating the medical knowledge of large language models (LLMs) has become a critical yet costly bottleneck for their safe integration into healthcare. Traditional benchmarks, which require thousands of questions, are financially and computationally prohibitive, often costing over $1,000 per evaluation and taking several hours to complete. This expense hinders the frequent monitoring needed as models rapidly improve, creating a gap in reliable oversight for AI systems intended for clinical support. A new study addresses this by adapting a psychometric long used in human testing, offering a way to assess LLMs with near-perfect accuracy at a fraction of the cost.

The researchers discovered that a computerized adaptive testing (CAT) framework can evaluate LLMs on standardized medical knowledge using only about 1.3% of the items from a full benchmark while maintaining exceptional fidelity. In an empirical evaluation of 38 LLMs, the CAT-derived proficiency estimates achieved a near-perfect correlation (r = 0.988) with scores from the full 2,815-question bank. More importantly, the adaptive test perfectly replicated the model rankings, with a Spearman's rank correlation of 1.0, meaning the order of performance remained identical despite the drastic reduction in questions. This precision was achieved while cutting evaluation time from approximately 6.85 hours to just 8.4 minutes per model on average.

Ology builds on item response theory (IRT), a psychometric model that calibrates item difficulty and discrimination based on human responses. The team utilized a secure, non-public item bank from China’s National Center for Health Professions Education Development, comprising 2,815 multiple-choice questions calibrated on over 40,000 senior medical students. They designed a CAT system that dynamically selects questions based on a model's real-time ability estimate, using a maximum Fisher information strategy to choose items that provide the most information about the model's proficiency. The test terminated when the standard error of the ability estimate fell below a predefined threshold of 0.316, corresponding to a reliability of 0.90, which was identified as optimal through Monte Carlo simulations comparing various stopping rules.

The data shows dramatic cost-effectiveness gains. The CAT protocol required an average of only 37 items to meet the stopping criterion, compared to 2,815 in the full bank, representing a 98.7% reduction in test length. This translated into a 98.3% reduction in token consumption (from over 1.77 million tokens to about 0.03 million) and a 98.0% reduction in time (from 24,673 seconds to 505 seconds). In simulations, the adaptive strategy outperformed a random selection baseline, which required more items and introduced larger bias, incorrectly ranking top models. The researchers also validated construct validity by correlating CAT scores with performance on an independent open-ended benchmark (LLMEval-Med), finding strong correlations (r ≈ 0.83), indicating the adaptive scores reflect genuine medical knowledge beyond multiple-choice heuristics.

Are significant for scaling AI evaluation in healthcare. By reducing the marginal cost of benchmarking, this approach makes frequent, protocol-controlled monitoring feasible, enabling longitudinal tracking of model updates within governance workflows. The study estimates that evaluating a model on a large benchmark like Med-HALT (59,254 items) could drop from around $1,475 to under $5 using CAT, transforming evaluation from a resource-intensive bottleneck into a routine operational capability. This supports pre-deployment screening and continuous oversight, though it does not replace clinical validation or safety testing, providing a standardized foundation for comparative assessment.

However, the framework has limitations. Its are anchored in a single nationally calibrated item bank; generalizability to other medical exam systems (e.g., USMLE or European formats) may require re-tuning of stopping rules and psychometric linking. The evaluation relies on multiple-choice questions, which primarily assess standardized knowledge and may not fully capture real-world clinical reasoning or safety-critical gaps. The researchers caution that CAT is an efficiency-oriented instrument for quantifying knowledge proficiency, not a safety clearance tool, and must be complemented by dedicated safety evaluations before any real-world deployment. Future work should extend adaptive measurement to workflow-aligned task formats and incorporate safety-aware constraints, such as mandatory core safety items.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn