AI Chatbots Fail Nepali Health Questions, Study Finds

TL;DR

Research shows AI models give unsafe, culturally wrong answers to sexual health queries in Nepali, exposing reliability gaps in low-resource languages.

As artificial intelligence chatbots become more integrated into daily life, they are increasingly used for personal and sensitive queries, including sexual and reproductive health (SRH), where users seek anonymous, non-judgmental advice. However, a new study evaluating large language models (LLMs) in Nepali, a low-resource language, reveals significant shortcomings in their ability to provide reliable and safe responses. The research, which assessed over 14,000 SRH queries from more than 9,000 users across Nepal, found that only 35.1% of conversations met the criteria for a "proper" response, meaning they were accurate, adequate, and free from major usability or safety gaps. This highlights a pressing need for improvement in AI systems, especially as they are deployed in healthcare settings where errors can have serious consequences.

The study introduced the LLM Evaluation Framework (LEAF), a multi-criteria assessment tool that goes beyond traditional accuracy metrics to evaluate language, usability gaps, and safety gaps. Usability gaps include factors like relevance, adequacy, and cultural appropriateness, while safety gaps cover issues such as safety, sensitivity, and confidentiality. Using this framework, researchers manually annotated responses generated by two variants of GPT-3.5: ChatGPT1 (vanilla GPT-3.5) and ChatGPT2 (GPT-3.5 with retrieval-augmented generation). showed that while 62.1% of responses were accurate, 43.8% of these accurate responses still had additional usability or safety gaps, indicating that accuracy alone is insufficient for ensuring quality in sensitive domains like SRH.

Ologically, the study involved developing the LEAF framework through an iterative process with SRH experts, creating a conversation-handling platform to manage user interactions, and applying the framework in community settings across Nepal. Users, including community participants and Female Community Health Volunteers, interacted with the chatbots for about 30 minutes each, asking questions in Nepali using either Devanagari or Romanized scripts. The platform integrated GPT-3.5 as the primary model after testing nine open-source LLMs, none of which matched its performance in Nepali. Data collection spanned from October 2023 to February 2024, with field mobilizers conducting outreach sessions to guide users, and ethical clearance was obtained from the Nepal Health Research Council.

Analysis of revealed several key insights. In terms of language, 84.2% of user queries were in Nepali, but only 73.5% of responses were in the same language, with 7.5% being multilingual mixes of Nepali, English, and Hindi. Usability gaps were prevalent, with 74.0% of responses labeled as inadequate, meaning they failed to fully address the questions, and 14.9% were irrelevant. Safety gaps, though less frequent, included 98 unsafe responses (0.74% of total), 51 insensitive or offensive responses (0.38%), and 7 non-confidential responses (0.05%). A small-scale comparison with GPT-4 showed improvement, with accuracy increasing from 26% to 50% in a sample of 100 queries, but s remained, such as poor performance on Romanized Nepali inputs.

Of these are significant for the deployment of AI in healthcare, particularly in low-resource settings. The study underscores that LLMs, while promising for scaling access to SRH information, require enhancements in adequacy, conciseness, and user-friendliness to be fully reliable. The LEAF framework offers a adaptable tool for evaluating AI responses across languages and domains, but the research also highlights practical s, such as facilitation bias in workshop settings and the resource-intensive nature of manual annotation. For policymakers and developers, this work emphasizes the need for continuous assessment and investment in AI systems that can safely address sensitive topics, ensuring they meet ethical standards and user needs in diverse cultural contexts.

Limitations of the study include potential biases from the facilitated workshop environment, which may not reflect real-world private usage, and underrepresentation of vulnerable groups like the elderly or very young adolescents. The manual labeling process was time-consuming, leading to some missing labels, and the reliance on GPT-3.5, while relevant due to its free availability, means may not fully apply to newer models like GPT-4. Additionally, the study did not assess user experience or fairness in responses, pointing to areas for future research. Despite these constraints, the research provides critical evidence on the gaps in AI performance for low-resource languages and sensitive health topics, urging caution and further refinement before widespread adoption.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn