AI Now Reasons Like a Scientist, Study Finds

TL;DR

A new training method helps language models blend world knowledge with data evidence, beating traditional algorithms at finding cause-and-effect links.

Large language models like GPT-4 have dazzled the world with their ability to write, code, and answer questions, but a fundamental flaw has lurked beneath the surface: they struggle to understand true cause and effect. When asked to analyze data and determine what causes what, these models often fall back on memorized facts and semantic associations rather than performing genuine scientific reasoning. This limitation has significant for applying AI to fields like medicine, economics, and climate science, where distinguishing correlation from causation is critical. A new study from Duke University introduces a framework called CARE that successfully teaches a relatively small AI model to overcome this hurdle, transforming it into a capable causal reasoning expert that can outperform both traditional algorithms and much larger language models.

The researchers discovered that when prompted to perform causal the task of inferring cause-and-effect relationships from observational data—large language models (LLMs) primarily rely on the semantic meaning of variable names, essentially reciting facts they memorized during training. For instance, when given variables like INCOME, EDUCATION, SMOKING, and LIFE EXPECTANCY, an LLM might correctly state that smoking causes lower life expectancy, but it does so based on its pre-existing knowledge, not by analyzing the provided dataset. This behavior, termed 'causal mimicry,' means the models fail to engage in genuine data-driven analysis. Surprisingly, the study found that simply providing these models with the outputs of established causal algorithms, which are designed to analyze data patterns, did not help and sometimes even decreased their performance. This highlighted a core inability: LLMs could not effectively integrate external algorithmic evidence with their internal knowledge through prompting alone.

To address this, the researchers developed CARE (CAusal Reasoning Experts), a supervised fine-tuning framework. The core ology involves training an LLM on a diverse set of causal problems where it must learn to synthesize two types of information: its own extensive world knowledge (a prior over causal structures) and the outputs of traditional causal algorithms like PC, GES, and LiNGAM, which serve as data-driven evidence. The training data was carefully augmented to the model's biases. For example, variable names were permuted to remove semantic hints, columns in datasets were reordered to eliminate reliance on data presentation, and variables were omitted to simulate incomplete information. By training on thousands of such augmented scenarios, the model learned to reason from statistical patterns independently of misleading cues and to correct algorithmic biases with its knowledge.

, Detailed across benchmark datasets like ASIA, SURVEY, EARTHQUAKE, and ALARM, show that a CARE-finetuned Qwen2.5-1.5B model achieved state-of-the-art performance. In the ASIA dataset, it reached a perfect F1 score of 1.000 with original variable names, significantly outperforming baseline LLMs like GPT-4.1-mini (0.858) and traditional algorithms (0.362). More impressively, under challenging conditions where variable names were permuted to mislead semantic reasoning, CARE maintained an F1 score of 0.460 on ASIA, while baseline models like GPT-4.1-mini dropped to 0.137. On the larger ALARM network with 37 variables, CARE achieved 0.990 with original names and 0.618 with permuted names, demonstrating robust generalization. These indicate that the fine-tuned model effectively synergizes algorithmic outputs with its knowledge, as shown in Figure 1 of the paper, where standalone approaches often falter but CARE integrates both sources for more accurate causal graphs.

Of this research are profound for both AI development and real-world applications. By enabling LLMs to perform reliable causal , CARE opens the door to using AI as a tool for scientific exploration in domains where understanding causality is essential, such as genomics, public health, and social sciences. For instance, in gene network , CARE could receive a graph proposed by a traditional algorithm based on expression data and refine it using its world knowledge, potentially leading to more accurate biological insights. This approach democratizes advanced causal analysis, allowing researchers without deep expertise in causal inference to leverage AI for complex reasoning tasks. The framework's success with a 1.5-billion-parameter model also suggests that scaling to larger models could yield even greater capabilities, making AI a more trustworthy partner in evidence-based decision-making.

Despite its successes, the CARE framework has limitations that point to future research directions. The study used the Qwen2.5-1.5B model due to computational constraints; applying the same fine-tuning ology to larger LLMs (e.g., 7B or 70B parameters) may further enhance performance but requires significant resources. Additionally, while CARE performed well on networks with up to 37 variables, its scalability to extremely large and complex graphs with hundreds or thousands of nodes remains untested. The process of generating diverse training data through augmentation, though effective, is computationally intensive, which could pose s for adapting CARE to new, specialized domains. These limitations, noted in the paper's discussion, highlight the need for continued work on efficiency and generalization in AI-driven causal .

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn