AI Learns to Score Medical Answers Like a Doctor

TL;DR

A new training method aligns AI with clinical reasoning using medical standards, topping a major health benchmark and cutting expert costs by 90%.

Large language models have shown promise in medicine, but their real-world clinical use is often hampered by a critical misalignment: they struggle to match the nuanced, dynamic thinking of human doctors. Traditional AI training relies on static benchmarks that fail to capture the complex priorities of medical scenarios, such as emergency triage versus chronic disease management, and cannot adapt to evolving, multi-source guidelines. This disconnect limits trust and utility, as models may provide accurate facts but lack the reasoning, safety, and compliance required in practice. A new framework, developed by researchers from Shanghai Mingpin Medical Data Technology Co., Ltd., aims to bridge this gap by embedding authoritative medical standards directly into AI training, enabling models to evaluate and generate responses more like clinicians.

The researchers propose MR-RML (Multidimensional Rubric-oriented Reward Model Learning) via GPRC (Geometric Projection Reference Constraints), a that restructures how AI learns medical alignment. At its core, the framework introduces a three-dimensional medical standard system called "Dimensions-Scenarios-Disciplines," which organizes evaluation criteria into a matrix. This matrix includes core dimensions like Information Content Quality and Clinical Assessment, various scenarios such as Emergency Triage Guidance, and disciplines like Internal Medicine or Pediatrics. By integrating this structured system into the full training pipeline, ensures AI responses are guided by real-world clinical needs rather than generic benchmarks. The approach shifts from real-time, costly rubric-based scoring to an internalized multi-dimensional reward model, which decomposes evaluation into specific criteria and uses geometric constraints to align scoring with medical logic.

To implement MR-RML, the researchers first constructed training data based on the 3D medical standard matrix. They generated questions covering different dimensions, scenarios, and disciplines, then created multi-dimensional rubrics for each question—verifiable standards that define what makes an answer good, medium, or poor. For example, a question about drug interactions for a child might include a rubric requiring clear mention of overlapping medication risks. Using large language models like GPT-4, they scored answers on a 1-5 scale across multiple dimensions, aggregating these into score vectors. This process produced a dataset with samples of varying quality, which was used for supervised fine-tuning to give the base model an initial alignment with medical knowledge. The key innovation lies in the reward model training: instead of outputting a single score, it projects answer vectors onto dimension-specific descriptions and applies geometric constraints to ensure scoring gradients reflect medical reasoning, such as medium answers scoring between poor and good ones.

, Evaluated on the authoritative medical benchmark HealthBench, demonstrate significant improvements. The model, based on Qwen-32B and named Shanzhi-M1, achieved a score of 62.7 on the full HealthBench subset, outperforming all open-source models and most closed-source ones, including OpenAI O3 (59.8) and Gemini 2.5 Pro. On the Hard subset, which includes high-complexity tasks like cross-lingual inputs, it scored 44.7, making it one of only two models globally to exceed 40 points, alongside GPT-5. As shown in Figure 3 and Figure 4, these gains represent a 45% improvement over the base model on the full subset and 85% on the Hard subset. The model also excelled in specific clinical scenarios, leading in areas like Emergency Referrals (74.3) and Communication (69.6), as detailed in Table 1. Importantly, the framework reduced expert labor costs by over 90% by minimizing manual annotation, relying instead on synthetic data and geometric constraints to maintain performance without compromising clinical effectiveness.

Of this research are substantial for healthcare AI, offering a pathway to more trustworthy and deployable medical assistants. By aligning AI with structured medical standards, MR-RML addresses key bottlenecks in professionalism, interpretability, and cost. It enables models to handle complex, real-world tasks with greater consistency, potentially improving patient care through better diagnostic support and communication. 's scalability, via synthetic data and reduced reliance on experts, could accelerate adoption in diverse medical settings, from emergency rooms to chronic disease management. However, the study acknowledges limitations: future work is needed to expand the 3D standard system to more specialized disciplines like radiology, integrate multi-modal data such as imaging, and adapt geometric constraints for real-time guideline updates. These steps will be crucial for ensuring the framework remains aligned with evolving medical practices and enhances clinical utility over time.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn