AI Detects Silent GPU Failures Before Systems Crash

TL;DR

A new method catches GPUs failing without warning by tracking monitoring breakdowns, giving earlier alerts for HPC and AI workloads.

In high-performance computing and AI systems, GPUs are critical components that can fail abruptly without any obvious warning signs, leading to costly downtime and data loss. A new study reveals that many of these failures occur quietly, with GPUs becoming unavailable at the driver or interconnect level—often described as "fallen off the bus" events—while traditional monitoring metrics like temperature and power consumption remain normal until the device disappears. This research, conducted using production telemetry from GPU nodes at the Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG), introduces an observability-aware early-warning framework that can detect these silent failures by analyzing structural changes in monitoring data, such as the disappearance of device metrics and degradation of scrape payload integrity. highlight a significant gap in conventional anomaly detection s, which rely on numeric precursors and often miss detachment-class failures that manifest primarily through observability collapse.

The key from this work is that GPU detachment failures exhibit minimal or no numeric precursor in standard telemetry, making them invisible to value-based monitoring approaches. Instead, the dominant observable signal is structural, involving the sudden loss of GPU device metrics, increased scrape latency, sample loss, and time-series gaps. The researchers analyzed an operator-curated incident catalog spanning January 2025 to February 2026, focusing on seven GPU detachment incidents, with five having complete telemetry archives for alignment. They found that in all processed cases, alignment time t0 was derived from scrape payload collapse, and GPU telemetry disappeared partially or completely at or immediately after this point, confirming that observability degradation is the primary manifestation of these failures. This s the assumption that gradual numeric drift always precedes hardware faults, emphasizing the need for new detection strategies.

Ology employed a reproducible pipeline to extract and analyze GPU telemetry, node signals, and monitoring indicators from large-scale HPC telemetry. The framework jointly models utilization-aware thermal-drift signatures in GPU telemetry and monitoring-pipeline degradation indicators, such as scrape latency increase and device-metric disappearance. Data was collected from production GPU nodes at GWDG, where GPU, node, monitoring, and scheduler logs could be correlated, allowing for weak-event construction using an incident catalog with coarse failure categories and day-level timestamps. The evaluation used fixed windowing with a window length of 60 minutes and a stride of 10 minutes, and detectors like robust z-score scoring, Isolation Forest, and One-Class SVM were compared under a fixed alert budget of 1%. This approach enabled the assessment of early-warning lead time and cross-node generalization without precise component-level failure labels.

From the baseline experiments show that joint modeling of GPU and observability features increases early-warning lead time compared to GPU-only detection. Under a fixed 1% alert budget, joint Isolation Forest achieved the highest average lead time of 7 windows (with one window corresponding to 10 minutes) and a maximum lead of 29 windows, while GPU-only detectors often triggered close to or after event onset, with median lead times of zero. For example, in detachment incidents on nodes like ggpu142 and ggpu149, forensic alignment revealed that GPU telemetry disappearance and scrape degradation occurred near t0, well before operator detection via health checks. The study also noted recurrence of detachment failures on specific physical nodes, such as ggpu142 experiencing two incidents within a month and ggpu149 having three over ten months, indicating host-level hazard that could inform proactive maintenance.

Of this research are substantial for operational reliability in AI and HPC infrastructures, where GPU failures can disrupt critical workloads. By treating observability degradation as a first-class anomaly signal, rather than dismissing it as monitoring noise, systems can achieve earlier detection of silent failures, potentially reducing downtime and improving resource management. The framework's dataset-agnostic design allows for generalization to other hardware-intensive environments, though the current analysis is limited to the GWDG dataset due to its unique correlation of telemetry with scheduler-level node state transitions. Future work could extend ology to multi-archive slices and broader cross-system validation as richer failure-context datasets become available, enhancing its applicability across diverse computing platforms.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn