AI's Reasoning Shortcuts Exposed in Multilingual Math Tests

A new investigation into how artificial intelligence models reason through problems has uncovered significant flaws in their transparency and reliability, particularly when operating in languages beyond English. Researchers from the University of Groningen applied specialized analysis techniques to a popular AI model, Qwen2.5 1.5B-Instruct, testing it on multilingual math word problems from the MGSM benchmark. They found that while structured reasoning prompts improve accuracy for high-resource languages like English and French, the benefits diminish dramatically for low-resource languages such as Bengali, where accuracy remained below 4%. More critically, the study revealed that the AI often assigns disproportionate importance to the final reasoning step, especially when its answer is incorrect, suggesting that the model's step-by-step explanations may not faithfully reflect its actual decision-making process.

The key finding centers on how the AI attributes importance to different parts of its reasoning chain. Using a called ContextCite to analyze step-level contributions, the researchers discovered that across all five tested languages—English, French, German, Chinese, and Bengali—the final reasoning step consistently received the highest attribution score. This pattern was most pronounced in incorrect predictions. For example, in Bengali, the slope of importance increase from first to last step was 12.73 for wrong answers compared to 9.31 for correct ones, indicating a steeper reliance on the final step when the model errs. In English, the slope rose from 5.24 for correct answers to 6.54 for incorrect ones. These suggest that the AI may be using the reasoning chain as a superficial justification rather than a genuine guide to its conclusions, with errors often tied to overemphasis on flawed final calculations.

To conduct this analysis, the team employed a combination of prompting techniques and attribution s. They used structured generation to force the Qwen model to produce Chain-of-Thought (CoT) reasoning—a step-by-step explanation—before giving a final answer, based on eight few-shot examples per language from the MGSM training set. For step-level attribution, they applied ContextCite, which systematically ablates reasoning steps and measures their impact on the final answer probability, fitting a linear model to assign importance scores. For token-level analysis, they used the Inseq toolkit with saliency attribution, computing gradients to see how sensitive predictions are to individual tokens. The experiments included controlled perturbations: negating key verbs in questions or adding irrelevant distractor sentences to test robustness, with examples like changing "He buys 2 more cans" to "He does not buy 2 more cans" in English problems.

Showed clear disparities in performance and attribution patterns across languages. In terms of accuracy, structured CoT prompting boosted scores significantly for high-resource languages: English reached 59.2%, French 48.8%, German 37.6%, and Chinese 35.2%, but Bengali only improved to 3.6% from a baseline below 3%. Token-level analysis via Inseq heat maps confirmed that step importance generally increased toward the end of the reasoning chain, regardless of experimental condition. However, under perturbations, accuracy dropped—for instance, in the negation condition, it fell to 25% across languages, with the model often failing to adjust its reasoning despite changed questions. The study also noted that longer reasoning chains, particularly those with five or more steps, frequently correlated with incorrect answers, especially in challenging conditions like negation, hinting that verbosity may signal unreliability.

These have important for the real-world use of AI in multilingual contexts. The reduced effectiveness of CoT prompting for low-resource languages like Bengali highlights ongoing inequities in AI development, where models trained predominantly on data from languages like English struggle with others due to tokenization inefficiencies and limited vocabulary. For everyday users, this means AI tools may provide less reliable explanations or solutions in non-Latin script languages, potentially leading to mistrust or errors in applications like educational assistants or customer service bots. The overemphasis on final steps in incorrect generations raises concerns about interpretability; if users rely on AI reasoning to understand decisions, they might be misled by explanations that don't accurately represent the model's internal process, undermining transparency in critical areas like healthcare or finance.

Despite these insights, the study has several limitations that caution against overgeneralization. The MGSM dataset consists of relatively simple grade-school math problems, which may not fully test the reasoning abilities of more advanced models. The conclusions are based on a single 1.5-billion-parameter model, and patterns might differ in larger architectures. Token-level attribution was conducted on a small case study with only eight items per language, limiting its statistical power. Additionally, structured generation success rates varied by language—from 54% in English to 98% in Chinese—affecting the comparability of analyzed samples. The reliance on specific attribution s like ContextCite and Inseq also means should be validated with alternative techniques to ensure robustness.

AI's Reasoning Shortcuts Exposed in Multilingual Math Tests

Original Source

About the Author

Guilherme A.