AI Scores 25% Lower on Math in Hindi and Tamil

TL;DR

Vision-language models lose up to 25 points of accuracy on math and science tasks in Indian languages, highlighting a gap in educational AI access.

A new study reveals that artificial intelligence systems designed to understand and reason about visual information perform significantly worse when asked questions in Indian languages compared to English. This finding has immediate for India's education system, where over 260 million school-age children study predominantly in regional languages like Hindi, Tamil, and Bengali. As educational technology platforms begin integrating these AI models for tutoring and assessment, the research suggests that non-English-speaking students could face systematic disadvantages if current models are deployed without improvement.

The researchers conducted a comprehensive audit of eight vision-language models, including GPT-4o and several open-source alternatives, testing their ability to solve mathematical, scientific, and spatial reasoning problems across seven languages. They translated 980 questions from established English benchmarks into Hindi, Tamil, Telugu, Bengali, Kannada, and Marathi, keeping the images identical across languages to isolate language effects. showed accuracy drops ranging from 9.8 to 25 percentage points when switching from English to an Indian language, with Dravidian languages like Tamil and Kannada suffering up to 13.2 percentage points more than Indo-Aryan languages like Hindi and Bengali.

To create this benchmark, the researchers used IndicTrans2, an open-source translation system, to convert questions from MathVista, ScienceQA, and MMMU datasets into the six Indian languages. They verified translation quality by cross-checking 50 samples per language with Gemini 2.0 Flash, achieving agreement scores between 0.79 and 0.84. The models were evaluated through 68,600 inference records, with additional tests that removed images or added chain-of-thought prompting to understand how the AI processes information differently across languages.

The data reveals clear patterns in model performance. GPT-4o achieved the highest average accuracy on Indian languages at 43.2%, while Gemma 3-27B showed the smallest average drop from English at 9.8 percentage points. However, even these top performers still exhibited significant gaps, particularly on mathematical reasoning tasks where Qwen3-VL-30B lost 29.3 percentage points on MathVista problems. The study also found that chain-of-thought prompting, which typically helps models reason step-by-step in English, actually degraded performance in most Indian languages, with Bengali accuracy falling 14.4 percentage points and Kannada dropping 11.4 percentage points.

These have practical for educational equity in India, where the National Education Policy 2020 mandates mother-tongue instruction through Grade 5. The research indicates that deploying current AI models as tutoring tools in regional-medium schools could disadvantage non-English students, especially in Dravidian-language schools where accuracy drops reach 12 to 28 percentage points depending on the model. The study also s assumptions about multilingual AI development, showing that models like Aya-Vision-8B, which was specifically trained on 23 languages including several Indian languages, still dropped 28.5 percentage points on Dravidian languages and had the lowest cross-lingual consistency at 67.1%.

The research has several limitations that future work should address. The translations rely on machine translation with native-speaker validation needed on larger samples to ensure accuracy, particularly for technical terms. Only six of India's 22 scheduled languages were tested, leaving low-resource languages like Odia and Assamese unexamined. Additionally, high extraction failure rates for some models mean their true reasoning capability might be higher than reported numbers suggest. The study serves as a call for mandatory multilingual reasoning evaluation before vision-language models are deployed in educational settings, with the translated benchmark and all model outputs made publicly available for further research.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn