Llama Beats Mamba at Predicting Heart Failure Risk

TL;DR

A new study finds Meta's Llama model outperforms Mamba for heart failure prediction, showing stronger accuracy on clinical patient data.

In a rigorous new study from Swedish researchers, the Llama architecture has demonstrated superior performance over the emerging Mamba model for predicting critical outcomes in heart failure patients using electronic health records. The research, conducted on a cohort of 42,820 patients from Sweden's Västra Götaland region, represents the first systematic ablation study comparing modern sequence models in a clinical setting with shorter context lengths. reveal that while both advanced architectures outperform traditional Transformers like BERT, Llama consistently achieves the highest predictive discrimination and best calibration across all tasks, challenging assumptions about which models work best with limited patient history data.

Researchers evaluated six sequence models across three architecture classes—Transformers (BERT, XLNet), Transformers++ (ModernBERT, Llama), and Mambas (Mamba, Mamba2)—on three clinically critical one-year prediction tasks: clinical instability after initial heart failure hospitalization, mortality after initial hospitalization, and mortality after the latest hospitalization. The study employed a comprehensive ablation framework examining four key research questions: how token granularity affects predictions, how architectural changes influence performance, how temporal preprocessing techniques impact , and how models scale with varying data availability. All models were trained on patient sequences derived from EHRs containing diagnoses, vital signs, laboratory , medications, and procedures, with context lengths limited to 512 tokens to reflect realistic clinical scenarios.

Were striking: Llama achieved the highest area under the precision-recall curve (AUPRC) across nearly all ablation studies, followed by Mamba-based architectures, with both significantly outperforming the commonly used BERT model. For clinical instability prediction after initial hospitalization, Llama reached AUPRC values between 0.555 and 0.557 depending on vocabulary settings, compared to 0.535-0.543 for BERT. In mortality prediction tasks, the gap was even more pronounced, with Llama achieving AUPRC of 0.574 for mortality after initial hospitalization versus 0.540 for BERT. The study also found that tiny configurations of Llama and Mamba often outperformed larger Transformer models, demonstrating more efficient representation learning with fewer parameters.

Perhaps most significantly, the research revealed that Llama and Mamba achieve superior performance using 25% less training data than other models, suggesting greater training efficiency through more effective representation learning. When trained on just 75% of the available data, Llama still outperformed other models trained on the complete dataset. The study also d assumptions about optimal data preprocessing, finding that aggregation techniques that compress high-frequency measurements into semantically meaningful summarizations often outperformed simple truncation approaches, particularly for Llama and Mamba architectures.

Of these extend beyond heart failure prediction to the broader field of clinical AI development. The researchers recommend Llama as the preferred architecture for EHR sequence modeling with shorter context lengths (≤512 tokens), specifically suggesting small-sized configurations with context length C=512 as the optimal balance of performance and computational efficiency. They also advocate for aggregation techniques over truncation for temporal preprocessing and emphasize that fine-grained measurement resolutions work best for initial trajectories while detailed diagnosis codes prove more valuable for later trajectories. These evidence-based recommendations provide a crucial starting point for future clinical model development in an increasingly crowded landscape of sequence modeling architectures.

Despite its comprehensive nature, the study acknowledges several limitations. The analysis is restricted to a Swedish heart failure cohort and lacks external validation, meaning may not generalize to other disease populations or healthcare systems. The selective curation of laboratories and medications for heart failure may limit models' ability to leverage out-of-domain knowledge, and the study didn't explore alternative formulations of patient embeddings or incorporate multi-modal data like cardiac images or clinical notes. Future work should focus on fusing these additional data sources and including information from primary care settings to create more comprehensive patient trajectories and potentially enhance predictive performance further.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn