AI Safety Tests Miss 88% of Crisis Signals in Chats

TL;DR

A new benchmark shows top AI models overlook most crisis signals in multi-turn talks, exposing dangerous gaps in safety testing for vulnerable users.

As AI systems increasingly support caregivers and therapy applications, a critical safety gap has emerged: current testing s fail to detect risks that only appear in long-term relationships. New research reveals that even the most advanced AI models miss up to 88% of crisis signals when evaluated across extended conversations, exposing vulnerable users to potentially dangerous oversights. This finding s the adequacy of single-turn safety benchmarks that have dominated AI evaluation, highlighting the urgent need for more sophisticated testing before deployment in sensitive contexts.

The researchers developed InvisibleBench, a deployment gate specifically designed to evaluate AI safety across multi-turn caregiving conversations. They tested four leading models—DeepSeek Chat v3, Claude Sonnet 4.5, Gemini 2.5 Flash, and GPT-4o Mini—across 68 evaluations spanning 17 scenarios with conversations ranging from 3 to 20+ turns. revealed universal safety failures: all models exhibited critical gaps in crisis detection, with scores ranging from just 11.8% to 44.8%. This means even the best-performing model missed 55% of crisis signals in these tests, while the worst missed 88%. The benchmark also identified complementary strengths across models, with GPT-4o Mini leading in regulatory compliance (88.2%), Gemini excelling in trauma-informed design (85.0%), and Claude showing the best crisis detection (44.8%), though still far from adequate.

The evaluation ology employed a three-tier architecture that mirrors real-world caregiving relationships. Tier 1 tested foundational safety with 3-5 turn conversations, Tier 2 examined memory and attachment dynamics across 8-12 turns, and Tier 3 evaluated multi-session longitudinal interactions spanning 20+ turns with temporal gaps. Each conversation was assessed across five dimensions: Safety (crisis detection), Compliance (regulatory fitness under the Illinois WOPR Act), Trauma (trauma-informed design principles), Belonging (cultural fitness and relational quality), and Memory (longitudinal consistency and hygiene). The researchers used a hybrid evaluation approach combining deterministic pattern matching with LLM-as-judge scoring, implementing multi-sample judgment distribution to quantify uncertainty and improve reliability.

The data from 68 evaluations reveals stark patterns that single-turn benchmarks cannot detect. Figure 3 in the paper shows dimension-wise scores where all models failed the Safety dimension, with GPT-4o Mini detecting only 11.8% of crisis signals and Claude—the best performer—still missing 55%. The researchers found that models performed substantially worse on masked crisis signals (indirect expressions like "stockpiling pills") compared to explicit ones, with an average 23.1% performance gap. Compliance scores showed extreme variance from 17.6% to 88.2%, indicating inconsistent regulatory boundary maintenance. Memory performance was uniformly strong (85-92%), while Belonging scores varied widely (64-92%), revealing differences in cultural competence. The benchmark also identified specific failure patterns by tier: diagnostic language violations in Tier 1, treatment recommendations by turn 10 in Tier 2, and memory hygiene violations in Tier 3.

These have immediate for organizations deploying AI in caregiving contexts. With 63 million American caregivers potentially interacting with conversational systems, the universal crisis detection failures demonstrate that current LLM-only approaches represent unacceptable risk. The researchers strongly recommend deterministic crisis routing—combining rule-based keyword detection with behavioral pattern recognition—as essential for production systems. The complementary model strengths suggest hybrid architectures may outperform single-model deployments, routing different scenarios to specialized models. The benchmark's cost-effectiveness ($0.03-0.10 per evaluation) makes comprehensive pre-deployment testing accessible even for resource-constrained organizations serving vulnerable populations.

Despite its comprehensive approach, InvisibleBench has several limitations that must be acknowledged. The evaluation uses researcher-written scenarios rather than real caregiver transcripts, which may not capture authentic communication patterns. The benchmark anchors regulatory compliance to the Illinois WOPR Act, requiring adaptation for international deployment. The LLM-as-judge evaluation, while using multi-sample judgment distribution for reliability, may have systematic blind spots and requires human expert validation. The attachment engineering detection is heuristic-based and prioritizes precision over recall, needing further validation. The current evaluation covers only four models and English-only scenarios, lacking multilingual testing. Future work includes human expert validation with clinical specialists, multi-seed reproducibility testing, trait robustness evaluation under caregiver stress conditions, and expansion to more models and languages.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn