Artificial intelligence systems designed to detect depression from clinical interviews may be learning more from the interviewer's script than from the patient's own words. A recent analysis of three major depression datasets—ANDROIDS, DAIC-WOZ, and E-DAIC—shows that models trained solely on interviewer prompts can match or even outperform those trained on participant responses, raising concerns about the validity of such automated assessments. This finding suggests that what appears to be advanced AI capability might instead be a shortcut exploiting the structured nature of semi-structured interviews, where fixed questions and their positions provide unintended cues. For non-technical readers, this means that AI tools in mental health could be making decisions based on artifacts of the interview process rather than genuine linguistic indicators of depression, potentially misleading clinicians and researchers who rely on these systems for support.
The researchers discovered that across all three datasets, interviewer-only models consistently achieved high classification scores for depression detection. In the ANDROIDS dataset, for example, an interviewer-only model using a Longformer architecture achieved a macro-F1 score of 0.98 on the development set, outperforming the participant-only model by 19%. Similarly, in DAIC-WOZ, interviewer-only models with graph convolutional networks (GCNs) scored 0.88 compared to 0.85 for participant-only models. This pattern indicates that the AI systems are leveraging systematic biases in the interview scripts, such as recurring prompts or specific question sequences, rather than analyzing the content of patient responses. The effect was observed with both transformer-based and graph-based models, showing it is not tied to a particular architecture but is a broader ological issue.
To investigate this bias, the team employed a straightforward ology: they trained and evaluated two variants of each model—one using only participant utterances and another using only interviewer prompts. The datasets included ANDROIDS, which features Italian interviews with minimal interviewer intervention; DAIC-WOZ, with North American English interviews conducted by a virtual interviewer named Ellie; and E-DAIC, an extension of DAIC-WOZ with fully automatic interviews. Since ANDROIDS and E-DAIC lacked complete ground-truth transcripts, the researchers used automatic speech recognition (WhisperX) to generate transcripts for both speakers, ensuring a fair comparison. This setup allowed them to isolate the contribution of interviewer prompts versus participant language in depression classification.
, Detailed in Table 1 and Table 2 of the paper, reveal that interviewer-only models often focus on narrow segments of interviews. Heatmaps in Figure 1 illustrate this: interviewer-only models show concentrated keyword evidence in specific bands, such as prompts about family or therapy, while participant-only models distribute evidence more broadly across the conversation. For instance, in ANDROIDS, interviewer models repeatedly highlighted prompts probing family context or work status, as shown in Figure 2a. In E-DAIC and DAIC-WOZ, models concentrated on questions like 'How do you cope with that?' or 'Do you still go to therapy?', ignoring other clinically relevant prompts. This selective focus suggests that the models are exploiting script artifacts rather than learning from diverse linguistic cues, which could inflate reported performance metrics in real-world applications.
These have significant for the development and evaluation of AI in mental health. If models rely on interviewer biases, they may not accurately capture depression symptoms from patient language, leading to unreliable tools for clinicians. The study emphasizes the need for bias-aware evaluation protocols, such as isolating participant-only turns or developing s to account for prompt-induced shortcuts. For general audiences, this highlights the importance of scrutinizing AI systems in healthcare to ensure they are learning from meaningful data rather than spurious patterns. As AI becomes more integrated into clinical settings, understanding and mitigating such biases is crucial to maintaining trust and effectiveness in automated mental health assessments.
However, the study has limitations that warrant caution. The use of automatic speech recognition for ANDROIDS and E-DAIC introduces potential noise from transcription errors, which could affect the comparison between participant and interviewer models. Additionally, the analysis is restricted to text data, omitting acoustic and visual features that might provide complementary signals or alter the bias observed. The researchers note that ground-truth transcripts for these datasets would allow a more precise estimation of prompt-induced bias. Future work should explore multimodal aspects to see if the bias persists when other modalities are included, ensuring a comprehensive understanding of AI performance in clinical interview analysis.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn