AI Agents Fail Medical Tools Despite Reading 3D Scans

TL;DR

A new benchmark shows AI navigates full 3D medical scans but fails with expert tools due to poor spatial grounding, a key barrier to clinical use.

Medical imaging analysis is a complex, multi-step process where radiologists must navigate through entire 3D scans, adjust settings, and use specialized tools to reach a diagnosis. However, most AI evaluations for medical vision-language models have relied on simplified setups, testing models on pre-selected 2D images rather than full studies. This misses the core of real-world diagnostics, where evidence must be gathered interactively across multiple slices and modalities. A new study introduces a framework that shifts this paradigm, enabling AI agents to operate dynamically within standard medical viewers like 3D Slicer, but uncover a surprising limitation: while agents can navigate scans, they struggle to use professional tools effectively.

The researchers developed M ED O PEN C LAW, an auditable runtime that links vision-language models to medical viewers, allowing agents to perform actions such as selecting series, scrolling through slices, and adjusting window settings. On top of this, they created M ED F LOW-B ENCH, a benchmark for full-study medical imaging analysis that evaluates agents across two clinical modules: multi-sequence brain MRI and lung CT/PET. The benchmark uses a three-track design to test different capabilities, including viewer-only navigation and tool-use with expert modules. Initial evaluations with state-of-the-art models like GPT-5.4 and Gemini-3.1-pro revealed that while agents can solve basic study-level tasks through viewer navigation, their performance paradoxically degrades when given access to advanced tools due to a lack of precise spatial grounding.

Ology centers on M ED O PEN C LAW, which provides a bounded interface for agents to interact with medical viewers without executing arbitrary code, ensuring auditability by logging every action and evidence artifact. The runtime organizes actions into three layers: primitive viewer actions for navigation, evidence operations for capturing views, and optional expert tools for advanced analysis like segmentation. M ED F LOW-B ENCH builds on this by defining study-level episodes that include full volumetric exams, task prompts, and answer schemas, evaluated under multiple-choice and open-ended protocols. The benchmark's tracks separate viewer-native tasks from tool-augmented execution, allowing for controlled comparisons of model capabilities in realistic clinical workflows.

From the experiments show that in the Viewer-Only track, models achieved moderate success, with Gemini-3.1-pro reaching a case-level accuracy of 0.63 in brain MRI and GPT-5.4 scoring 0.46 in tumor location tasks for lung CT/PET. However, performance dropped on fine-grained tasks like histopathological grade prediction, where models struggled to surpass random chance. More critically, in the Tool-Use track, equipping agents with segmentation toolpacks led to a decrease in accuracy; for example, GPT-5.4's brain MRI accuracy fell from 0.61 to 0.57, and lung CT/PET performance dropped from 0.32 to 0.27. This decline occurred because agents lacked the millimeter-level spatial precision needed to guide tools effectively, often generating misaligned masks that misled diagnostic reasoning.

Of these are significant for the future of medical AI. The "tool-use paradox" highlights a bottleneck in spatial grounding, where current vision-language models excel at logical reasoning but fail at precise control required for clinical tools. This gap must be addressed for AI to reliably assist in real-world settings. Additionally, the auditable nature of M ED O PEN C LAW addresses a key requirement for clinical trust, as it provides transparent traces of every action, making AI decisions reviewable and compliant with regulatory standards. By bridging benchmarks with practical applications, improvements in this framework could directly enhance human-in-the-loop systems like M ED C OPILOT, which aims to reduce manual overhead for clinicians by automating viewer interactions.

Limitations of the current work include its focus on only two clinical modules—brain MRI and lung CT/PET—which may not represent the full diversity of medical imaging. The researchers note that future releases plan to expand to other modalities like ultrasound and mammography, introduce multi-turn conversational evaluations, and integrate more tools from ecosystems like MONAI. These steps will help test model capabilities as spatial grounding improves. Overall, this study establishes a foundation for evaluating medical agents in realistic, full-study contexts, showing that viewer-native reasoning is feasible but tool-augmented execution remains a compelling for the AI community to solve.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn