Large language models (LLMs) like GPT-4o and Qwen3-235B are transforming artificial intelligence, but their performance falters when faced with India's diverse cultural and linguistic contexts. A new benchmark, BhashaBench V1, reveals that these models excel in English-centric tasks but struggle with domains critical to Indian society, such as Ayurveda, where top models achieve only 59.74% accuracy compared to 76.49% in legal domains. This gap highlights a pressing need for AI systems that understand local knowledge, as over 100 million people in India rely on agriculture alone, and billions of transactions occur annually in sectors like finance and healthcare, where misunderstandings could have real-world consequences.
The researchers developed BhashaBench V1 to evaluate LLMs on India-specific knowledge across four key domains: Agriculture (BBK), Legal (BBL), Finance (BBF), and Ayurveda (BBA). They compiled 74,166 question-answer pairs from authentic sources, including government exams and certifications, with 70.8% in English and 29.2% in Hindi. The benchmark spans over 90 subdomains and 500 topics, such as Panchakarma in Ayurveda and Cyber Law in legal studies, ensuring it reflects ground-level challenges faced by practitioners. For example, in agriculture, questions cover crop science and soil health, while finance includes India-specific topics like UPI transactions and rural economics.
To create the benchmark, the team used a meticulous data collection and processing pipeline. They sourced questions from publicly available exams, applied optical character recognition (OCR) with Surya software for high accuracy in Indic languages, and structured the data into multiple formats, including multiple-choice, fill-in-the-blanks, and reading comprehension. Each question was validated for authenticity and categorized by difficulty—easy, medium, or hard—and subdomain. This approach ensured the dataset captures nuanced, culturally relevant knowledge without fabricating or altering original content.
Results from testing various LLMs show significant disparities in performance. In zero-shot evaluations, the top-performing model, Qwen3-235B-A22B-Instruct, achieved an overall accuracy of 67.25%, but scores dropped in specific areas; for instance, it scored 91.43% in Agricultural Research but only 44.19% in harder Ayurveda questions. Smaller models, like those with under 4 billion parameters, performed poorly, with accuracies as low as 28% in some subdomains. The study also found that model size alone does not guarantee success; architectural choices and training data influence outcomes, with instruction-tuned models generally performing better in bilingual tasks.
This research matters because it addresses a global issue of AI inclusivity. As LLMs are deployed in education, healthcare, and legal services, their inability to handle non-Western contexts could exacerbate inequalities. For example, farmers using AI for crop advice or patients seeking Ayurvedic treatments may receive inaccurate information if models lack relevant knowledge. The benchmark enables developers to identify and fix these gaps, fostering AI that serves diverse populations effectively.
Limitations of the study include the benchmark's focus on only Hindi and English, overlooking India's hundreds of other languages and dialects. Additionally, it does not cover all Indian knowledge systems, such as indigenous crafts or regional governance, and may reflect biases in exam-based sources. Future work could expand language coverage and incorporate more grassroots perspectives to improve fairness and relevance.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn