Large language models (LLMs) are increasingly used in scientific research, but their trustworthiness in high-stakes applications remains a critical concern. A new study reveals that even science-specialized AI models exhibit significant vulnerabilities in truthfulness, safety, and ethics, raising alarms about their deployment in sensitive fields like biosecurity and weapons research. This evaluation provides a clear framework for assessing AI reliability, essential for researchers and policymakers aiming to prevent costly errors or harmful outcomes.
Researchers from Oak Ridge National Laboratory developed SciTrust 2.0, a comprehensive framework to evaluate LLM trustworthiness across four dimensions: truthfulness, adversarial robustness, scientific safety, and scientific ethics. They tested seven prominent LLMs, including general-purpose models like GPT-o4-Mini and Claude-Sonnet-3.7, and science-specialized ones such as SciGLM-6B and Galactica-120B. The study found that general-purpose models consistently outperformed science-specialized ones in nearly all areas, with GPT-o4-Mini showing the highest overall trustworthiness, including strong resistance to hallucinations and adversarial attacks.
The methodology involved creating novel benchmarks through a reflection-tuning pipeline, where models generated and refined question-answer pairs from scientific literature, validated by experts in fields like chemistry and biology. For truthfulness, assessments used multiple-choice tests (e.g., SciQ, MMLU) and open-ended evaluations with metrics like ROUGE-L and BERT scores. Adversarial robustness was tested by perturbing inputs with character, word, and sentence-level changes to simulate real-world noise. Safety evaluations employed benchmarks like WMDP for biosecurity and cybersecurity, and HarmBench for high-risk behaviors, while ethics were assessed using scenarios in areas such as dual-use research and genetic modification.
Results showed stark disparities: science-specialized models had significant deficiencies in logical reasoning, with FORGE-L and Galactica-120B underperforming in tasks requiring inference. In safety tests, models like SciGLM-6B and Galactica-120B exhibited high attack success rates in chemical and biological weapons categories, indicating potential misuse risks. Ethics evaluations revealed that general-purpose models achieved near-perfect scores in identifying unethical scenarios, whereas science-specialized ones, such as Darwin1.5-7B, performed poorly, suggesting a lack of embedded ethical frameworks. Hallucination detection further highlighted vulnerabilities, with science-specialized models showing higher rates of generating false information.
These findings matter because they underscore the risks of relying on AI in scientific applications without rigorous checks. For instance, in fields like medicine or environmental science, inaccurate or unsafe AI outputs could lead to experimental failures or ethical breaches. The study's open-source framework, available on GitHub, offers a foundation for developing more trustworthy AI systems, helping researchers and developers prioritize safety and integrity in high-stakes scenarios.
Limitations of the research include its focus on text-based interactions, leaving multimodal data like images and molecular structures for future work. Additionally, the study did not explore correlations between benchmark performance and real-world outcomes, indicating a need for further validation with domain experts. By addressing these gaps, future research can enhance AI reliability, ensuring that technological advances do not compromise scientific integrity or public safety.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn