In the race to apply artificial intelligence to every corner of finance, foundation models like large language models (LLMs) have emerged as the shiny new tools promising to revolutionize everything from algorithmic trading to fraud detection. Their allure is undeniable: the ability to tackle complex tasks without extensive, domain-specific engineering, offering a one-size-fits-all solution to problems that have traditionally required specialized expertise. Yet, a groundbreaking new study from researchers at Wrocław University of Science and Technology, Tooploox, and Opera delivers a sobering reality check. When it comes to the critical, high-stakes task of predicting corporate bankruptcy—a linchpin of financial risk assessment for investors and regulators—these general-purpose models are being decisively outperformed by classical, specialized machine learning s. The research, presented at the NeurIPS 2025 Workshop on Generative AI in Finance, provides the first systematic comparison of foundation models against established baselines on large-scale, real-world financial data, revealing significant gaps in performance, reliability, and practicality.
The study's ology was meticulously designed to mirror the harsh realities of financial forecasting. The researchers constructed five massive datasets from over one million financial statement records of companies across the Visegrád Group (Czech Republic, Hungary, Poland, Slovakia) spanning 2006 to 2021. Each dataset represented a different prediction horizon, from immediate risk (0 years ahead) up to four years into the future, with the target variable defined by stringent financial distress criteria including negative equity, negative EBITDA relative to assets, and a critically low current ratio. This created a severe class imbalance, with bankruptcy rates below 1%, accurately reflecting the challenging conditions of real-world finance. To ensure a fair fight, all models were evaluated on the same stratified test subset of 20,000 samples, with the foundation models pitted against five classical baselines: logistic regression, a multi-layer perceptron, and the gradient boosting powerhouses XGBoost, LightGBM, and CatBoost.
The foundation models under scrutiny were two prominent representatives: Llama-3.3-70B-Instruct, accessed via API to process serialized financial data through natural-language prompts, and TabPFN, a transformer specifically designed for tabular data. , measured by ROC-AUC and the crucial F1-score for imbalanced data, were unequivocal. Across all five prediction horizons, the classical s, particularly gradient boosting, maintained a commanding lead. For instance, at the four-year horizon, XGBoost achieved a ROC-AUC of 0.891 and an F1-score of 0.024, while CatBoost posted 0.883 and 0.062. In stark contrast, TabPFN managed only 0.771 and 0.024, and Llama-3.3 languished at 0.782 and 0.012. The performance gap was not marginal; it was a chasm, with traditional models showing stable, superior accuracy even as prediction difficulty increased with time.
Beyond raw performance numbers, the study uncovered profound for the practical deployment of foundation models in risk-sensitive financial environments. The researchers identified a critical flaw in LLM-based approaches: their self-reported probability estimates are fundamentally unreliable for risk assessment. Analysis revealed that Llama-3.3's outputs were poorly calibrated and discretized, clustering stubbornly around fixed values like 0.1, 0.2, 0.7, and 0.9 instead of providing the smooth, nuanced probability distributions essential for informed financial decision-making. This 'degenerate distribution' undermines the very premise of using these models for calibrated risk scoring. Furthermore, the computational economics simply don't add up. The timing analysis was damning: XGBoost processed 20,000 samples on a standard CPU in 0.007 seconds, achieving a staggering throughput of over 2.8 million samples per second. TabPFN, requiring a high-end NVIDIA A100 GPU, took 23.78 seconds, and the API calls for Llama-3.3 dragged on for nearly 1.5 hours. This massive overhead, coupled with inferior , presents an insurmountable barrier to business justification.
The study acknowledges its limitations, primarily centered on the LLM approach. The researchers used the top-tier Llama-3.3-70B-Instruct available at the time but note that newer 'reasoning' models like GPT-5, Claude Opus, or DeepSeek-V3 might deliver stronger performance. Additionally, API-level access restricted them to the probability estimates returned by the model; direct access to internal weights and logits could potentially yield more reliable confidence measures. However, these limitations do not diminish the core finding: for the structured, numerically intensive task of bankruptcy prediction on large-scale datasets, foundation models in their current form are not yet viable replacements for purpose-built tools. The research suggests future directions could explore hybrid multimodal approaches or LLMs with enhanced reasoning capabilities, but for now, the path to reliable financial forecasting remains firmly in the domain of specialized, classical machine learning.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn