Artificial intelligence models that process Arabic often struggle with the language's diverse dialects, which are crucial for daily communication in social media, customer service, and regional interactions. A new study introduces IALECTAL MMLU, a benchmark designed to evaluate these dialectal capabilities, highlighting that even advanced models perform poorly outside the standardized Modern Standard Arabic (MSA). This gap affects millions of Arabic speakers who rely on AI for accurate, context-aware responses in their local languages.
Researchers discovered that large language models (LLMs) show significant performance drops when tested on dialectal Arabic compared to MSA or English. In experiments, accuracy for dialects like Syrian, Saudi, and Moroccan was consistently lower, with some models scoring close to random guessing. For instance, in the default testing setting without hints, models averaged lower scores across dialects, underscoring a lack of generalization. This finding is critical because dialects dominate real-world usage, yet most AI training focuses on MSA, leaving models ill-equipped for practical applications.
The methodology involved creating IALECTAL MMLU by manually translating 3,135 English multiple-choice questions from the MMLU-Redux benchmark into five Arabic dialects: Egyptian, Moroccan, Saudi, Syrian, and Emirati. This resulted in 15,675 dialectal question-answer pairs, plus English and MSA versions, totaling 21,945 instances across 32 domains like history, science, and law. Native speakers handled the translation and review to ensure naturalness and accuracy, with quality checks showing high fidelity—most translations scored 4 or 5 on a 5-point scale. The benchmark was then used to test 19 open-weight LLMs, ranging from 1 billion to 13 billion parameters, under different conditions to assess comprehension and reasoning.
Analysis of the results revealed that dialectal performance lagged behind MSA and English, with models like jais-13b-chat and gemma-3-12b-it showing variations but no consistent superiority. In dialect identification tasks, models often failed to recognize the dialect correctly, with Tools-DID achieving the best recall but many models performing worse than random. The oracle setting, where dialect labels were provided, did not significantly improve accuracy, indicating that models cannot leverage this information effectively. Additionally, translating dialectal questions to English via tools like Google Translate sometimes boosted performance, but not to the level of original English questions, suggesting translation introduces errors that hinder understanding.
The implications are significant for AI deployment in Arabic-speaking regions, where dialects are essential for education, healthcare, and business. For example, an AI tutor or chatbot that only understands MSA might misinterpret user queries in local dialects, leading to errors in responses. This limitation could exacerbate digital divides, as people in dialect-rich areas may not benefit fully from AI advancements. The study calls for more dialect-aware training data and strategies to make AI more inclusive and effective in diverse linguistic contexts.
Limitations of the research include the benchmark's focus on only five dialects, not covering the full spectrum of Arabic variations, and the use of open-weight models up to 13 billion parameters, which may not reflect the capabilities of larger, proprietary systems. Future work aims to expand to more dialects and tasks, addressing these gaps to foster better AI adaptability.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn