AI Nutrition Chatbots Fall Short in 7-Week Real-World Study

TL;DR

A major trial with 81 participants found LLMs provided no consistent benefits for diet coaching or emotional support, despite strong lab results.

As large language models (LLMs) increasingly find their way into healthcare applications, from symptom checking to mental health support, a critical question remains: do they actually help people in real-world settings? In nutrition, a domain where daily decisions impact long-term health, the hype around AI-powered chatbots has been particularly pronounced, with claims of personalized meal advice and empathetic coaching. However, a new study published on arXiv provides the first rigorous, real-world evaluation of LLMs in this sensitive area, and are sobering. Despite performing well in controlled lab tests, LLM-enhanced features failed to deliver meaningful improvements in dietary adherence, emotional well-being, or user engagement over a seven-week randomized controlled trial. This research, conducted by a team from Charles University, Saarland University, TU Darmstadt, and the University of Aberdeen, s the assumption that advanced language models automatically translate to better health outcomes, highlighting a significant gap between intrinsic AI performance and extrinsic human impact.

The core finding of the study is stark: LLM-based features added to a diet-coaching chatbot had little to no consistent effect on any of the measured outcomes. Researchers integrated two AI capabilities into a rule-based chatbot: a rephrasing module to make templated responses more conversational and engaging, and a nutritional counselling model fine-tuned on expert-annotated data to provide tailored support for dietary struggles. In a trial with 81 participants, they compared three groups: one using only the base chatbot with templated insights, another with the rephrasing feature, and a third with both rephrasing and counselling. Over seven weeks, participants logged meals via MyFitnessPal and interacted with the chatbot, with outcomes tracked for dietary goals, emotional state via the PANAS questionnaire, and engagement metrics. The data showed no significant benefits from the LLM enhancements; for example, improvements in dietary adherence were minimal, often less than 1-7.5%, and statistical analysis revealed only one isolated significant result—for carbohydrate goals in the full-feature group compared to the rephrased group—which the authors deemed spurious in context.

Ology centered on a pragmatic, human-centered approach, moving beyond typical AI benchmarks to a randomized controlled trial (RCT), the gold standard in healthcare evaluation. The researchers started with an existing rule-based chatbot that provided personalized nutritional insights based on users' food diaries, such as calorie and nutrient recaps. They then augmented it with two LLM features: the rephrasing module used prompt engineering with models like Llama 3 8B to vary templated messages while maintaining clarity, and the nutritional counselling model was fine-tuned on the HAI-Coaching dataset, containing about 2.4K dietary struggles paired with expert responses across categories like reflection, comfort, and suggestion. The trial design involved random assignment to the three groups, with participants required to engage with the chatbot at least five times weekly and complete weekly emotional well-being assessments. This setup allowed for direct comparison of AI-enhanced interactions against a controlled baseline, with metrics derived from real user behavior rather than simulated tasks.

Analysis of revealed a disconnect between AI performance in lab evaluations and real-world impact. In intrinsic tests, the rephrasing feature was preferred by 65% of human evaluators for its naturalness, and the counselling model achieved high scores on automatic metrics like BLEU-3 and BLEURT. However, in the RCT, these advantages did not translate. Dietary outcomes, measured as distance from personal intake goals for calories, carbs, protein, fat, sodium, and sugar, showed no consistent improvements; mixed-effects models found almost no significant differences between groups. Emotional well-being scores for positive and negative affect also displayed negligible changes, with no statistical significance. Engagement metrics, such as interactions and conversations with the chatbot, declined over time for all groups, though the full-feature group had slightly higher interaction days, likely due to the extra counselling prompts. User feedback further underscored limitations, with participants citing issues like generic advice from the counselling model and frustrations with the chatbot's natural language understanding, particularly around date formats and typos.

Of this study are profound for the deployment of AI in healthcare and beyond. It demonstrates that even when LLMs excel at generating human-like text and perform well on standard benchmarks, they may not deliver tangible benefits in complex, real-world scenarios like sustained behavior change. In nutrition, where evidence-based practices are crucial, the lack of impact suggests that current AI approaches might be insufficient for addressing the psychological and behavioral nuances of dieting. The researchers caution against adopting LLMs in sensitive domains without rigorous extrinsic validation, emphasizing that interdisciplinary, human-centered design is essential to bridge the gap between technological promise and practical utility. This work serves as a critical reminder that AI evaluation must extend beyond intrinsic metrics to include real human outcomes, especially in fields where trust and safety are paramount.

Several limitations of the study point to areas for future research. The chatbot's natural language understanding component struggled with varied user inputs, such as different date formats, which affected user experience. Additionally, the nutritional counselling model was trained on a dataset that included overly generic advice, limiting its personalization, and it did not integrate with users' food diary data to tailor suggestions—a shortcoming noted by participants. The trial used relatively smaller models like Llama 3 8B due to hardware constraints, leaving open the possibility that larger models might perform better, though practical deployment needs like fast inference times were a priority. The seven-week duration, while pragmatic, may have been too short to observe effects, and the study's focus on a task-specific chatbot, necessary for safety, contrasts with the adaptability of open-domain models like ChatGPT, which users increasingly expect. These factors highlight the trade-offs between specialization and flexibility in AI applications for healthcare.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn