AI Reasoning Fails Outside Science, Study Finds

TL;DR

New research shows AI models trained for scientific tasks struggle with everyday problems, revealing a key gap in how well AI can generalize.

Artificial intelligence systems that excel at answering science questions often stumble when faced with more common s like interpreting charts or understanding everyday scenes. This limitation, uncovered in a recent study, points to a fundamental weakness in how AI models learn to reason across different types of problems. Researchers from the Indian Institute of Technology Bombay tested a popular multimodal reasoning framework on three diverse datasets, finding that performance dropped dramatically compared to its strong on scientific benchmarks. suggest that current AI reasoning s may be too specialized, raising questions about their real-world applicability.

The researchers discovered that the Multimodal Chain-of-Thought framework, which had achieved 90.45% accuracy on the ScienceQA benchmark, performed significantly worse on other types of reasoning tasks. On ChartQA, which requires numerical reasoning from charts and graphs, accuracy plummeted to just 14.30%. For A-OKVQA, which involves commonsense reasoning about everyday images, accuracy reached 32.00%, while OK-VQA, requiring external world knowledge, achieved 21.31% accuracy. These , detailed in Table 2 of the paper, demonstrate that the model's reasoning capabilities don't transfer well beyond the structured scientific questions it was originally designed to handle.

The study implemented the two-stage Multimodal-CoT framework proposed by Zhang et al., which separates rationale generation from answer inference. This approach uses a T5-based language model combined with vision features through a gated fusion mechanism. To test cross-domain generalization, the researchers adapted this framework to work with three very different datasets: ChartQA for numerical reasoning from visual charts, OK-VQA for commonsense reasoning requiring external knowledge, and A-OKVQA for ambiguous open-ended reasoning with multiple valid answers. Each dataset required substantial modifications to the original pipeline, including new data loaders, prompt engineering for open-ended answers, and different evaluation metrics since these datasets don't follow ScienceQA's multiple-choice format.

The data reveals clear patterns in where the model succeeds and fails. ChartQA proved particularly challenging because it requires numerical reasoning and interpretation of structured visual layouts like bar charts and line plots—tasks the model wasn't optimized for. The researchers noted that vision encoders trained on natural scenes struggle with chart elements like thin lines, axis ticks, and small text labels. A-OKVQA performed better because its format more closely resembles ScienceQA, with richer annotations and questions that rely more on textual cues than visual signals. However, even this 32% accuracy falls far short of the original 90.45% on ScienceQA, indicating that vision encoder mismatches and weak rationale quality contribute to the performance gap.

These matter because they reveal a critical limitation in current AI reasoning systems: they often perform well on narrow, specialized tasks but fail to generalize to the diverse s humans handle daily. The study identifies specific failure modes, such as difficulty with numerical computation, poor handling of open-ended answers, and struggles with visual elements outside natural scenes. For practical applications, this means AI systems designed for scientific reasoning may need substantial retraining or architectural changes to work effectively in other domains like business analytics, everyday assistance, or educational tools that require interpreting various visual formats.

The research acknowledges several limitations that affect . The experiments were conducted with limited computational resources, using CPU-only training that prevented full-scale fine-tuning. The vision encoders used weren't optimized for the specific image types in the new datasets, particularly charts with geometric layouts. Additionally, the model's original design for multiple-choice questions made adaptation to open-ended formats challenging. These constraints mean the performance gaps might be partially addressable with more resources, but they still highlight fundamental s in cross-domain generalization that researchers need to address in future work.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn