AI Advising Students on Study Abroad Carries Hidden Risks

TL;DR

A new study finds AI chatbots give incomplete or unsupported answers to study-abroad questions, with patterns that could mislead students.

Large language models are increasingly being used to answer high-stakes questions about study-abroad processes, such as admissions, visas, scholarships, and eligibility. This shift from general chat applications to domain-specific advising raises critical concerns about reliability, as students rely on these tools for decisions that can impact their education and finances. The study by ApplyBoard researchers provides a clear, domain-grounded overview of how current LLMs behave in this setting, evaluating both accuracy and hallucination using realistic questions from their advising workflows. highlight that while models can be helpful, they often drift into unsupported claims or incomplete answers, posing risks for deployment in educational contexts where trust and accuracy are paramount.

Researchers discovered that models achieve broadly similar aggregate performance but exhibit distinct behavioral profiles when answering multi-domain study-abroad questions. Using a dataset drawn from ApplyBoard's internal workflows, they evaluated models on accuracy—categorized as correct, partial, or wrong—and hallucination, which includes unsupported or off-topic content. For example, in Figure 1, accuracy comparisons show that models like GPT-5 and Claude 3.7 Sonnet perform strongly, but partial answers are common, indicating that many responses miss crucial details or add unnecessary domains. The study found that models prioritizing adherence to evidence, such as Gemini variants, minimize hallucinations, while those focused on topical relevance, like Claude models, often add unsupported atomic details, as shown in Table 1 with metrics like faithfulness and ANAH v2 scores.

Ology involved a domain-grounded evaluation protocol that jointly assesses accuracy and hallucination. Researchers used a three-level rubric for accuracy: correct, partial, or wrong, with partial answers capturing cases where the main domain is addressed but some facets are missing or extra domains are introduced. They measured factual correctness without references to test stable knowledge and answer accuracy with references for time-sensitive rules, employing metrics from RAGAS such as factual correctness and NVIDIA answer accuracy. For hallucination, they operationalized intrinsic hallucination with faithfulness, answer relevancy, and ANAH v2, which analyzes segment-level unsupported claims. All models were tested under the same system prompt and question set for a fair comparison, as detailed in the paper's sections on accuracy and hallucination benchmarks.

Analysis reveals specific patterns across models. In terms of accuracy, thinking-style models that use step-by-step reasoning often expand answer scope, which can help on multi-domain questions but also nudges responses from correct to partial by over-including less-relevant domains, as noted in the evaluation . Without references, this leads to more partial answers; with references, it can improve alignment but also increases the risk of extrapolation beyond evidence. For hallucination, Figure 2 shows that Gemini models are the most faithful, adhering closely to references, while Figure 3 indicates Claude models achieve the strongest answer relevancy. However, Figure 4 and Table 1 demonstrate that ANAH v2 scores vary, with GPT thinking variants and Gemini Pro showing low segment-level hallucination, whereas Claude models and open-source baselines accumulate more unsupported claims, highlighting a divergence between topical focus and grounding.

Of these are significant for real-world deployment in education and advising. The study shows that no single model excels in all areas; instead, practitioners must consider task-specific needs. For policy-sensitive flows like visa eligibility, reference-bound models like Gemini are preferable due to their high faithfulness and low hallucination risk. For oriented queries, high-relevancy models like Claude can be useful but require safeguards such as citation requirements or abstention when evidence is insufficient. The evaluation protocol offers a reproducible framework for auditing LLMs before deployment, enabling better model selection and prompt design to mitigate risks like wasted time or financial loss for students.

Limitations of the study include its focus on intrinsic hallucination, which measures deviations from provided references, rather than extrinsic hallucination that probes open-world factuality without access to references. The evaluation used a fixed snapshot of references, which may not capture real-time policy changes, and the HHEM-2.1 neural hallucination detector was excluded due to token limits that truncated longer answers. Additionally, the study's dataset is specific to ApplyBoard's workflows, which may not generalize to all study-abroad contexts, and the partial-credit rubric, while practical, relies on human judgment that could introduce variability. These constraints suggest that ongoing monitoring and adaptation are necessary for safe LLM deployment in dynamic advising environments.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn