AI Fails to Understand Tunisian Arabic, Study Finds

TL;DR

Large language models struggle with the Tunisian dialect, risking cultural exclusion and forcing millions to use foreign languages for basic AI tasks.

As artificial intelligence becomes increasingly integrated into daily life through devices like smartphones and smartwatches, its ability to understand human language is crucial for natural interaction. However, a new study highlights a significant gap: many large language models (LLMs) struggle to comprehend low-resource languages such as Tunisian Arabic, spoken by over twelve million people. This oversight threatens to exclude Tunisians from fully engaging with AI in their native tongue, potentially forcing them to use French or English instead. Such a shift could undermine the preservation of the Tunisian dialect, impact literacy, and influence younger generations to favor foreign languages, raising concerns about cultural identity and technological equity.

Researchers evaluated several popular LLMs on three key tasks using a novel dataset of 100 manually curated examples from social media, each including Tunisian Arabic in Latin script (Tunizi), its Arabic script equivalent, an English translation, and a sentiment label. For transliteration, which converts Tunizi to Arabic script, Gemini 2.5 Flash performed best with a Character Error Rate (CER) of 0.15, Levenshtein distance of 2.89, and Longest Common Subsequence (LCS) similarity of 0.88, indicating strong phonetic and orthographic accuracy. In contrast, models like Qwen 3 and Mistral 8×22B lagged significantly, with CERs of 3.16 and 1.29, respectively, and high Levenshtein distances, revealing difficulties with informal dialectal variations and code-switching. These underscore s posed by the lack of a standardized written form and regional diversity in Tunisian Arabic.

The translation task, which assessed how well models convert Tunisian Arabic to English, further exposed performance disparities. Gemini 2.5 Flash again led with a METEOR score of 0.45 and a BERTScore F1 of 0.91, showing it could preserve semantic meaning beyond surface-level word matches. Claude Sonnet 4.5 and GPT-4o Mini followed closely, with METEOR scores of 0.37 and BERTScore F1 values of 0.90, though they sometimes normalized dialectal expressions into Modern Standard Arabic. On the lower end, Grok 3, Mistral 8×22B, and Qwen 3 had METEOR scores around 0.09–0.10 and BERTScore F1 below 0.86, often producing mistranslations or overly literal outputs due to limited exposure to North-African dialect data in their training.

Sentiment analysis, which classified comments as positive, negative, or neutral, revealed that GPT-4o Mini achieved the highest accuracy at 0.60 and a weighted F1 score of 0.60, with strong performance in identifying positive sentiment (F1 of 0.69). Claude Sonnet 4.5 also performed competitively, excelling in neutral sentiment classification with an F1 of 0.64. However, models like Gemini 2.5 Flash and Qwen 3 showed lower overall scores, with Cohen's Kappa values as low as 0.20 and 0.14, indicating poor agreement with ground truth labels and biases toward certain classes. This variability highlights how model architecture and pretraining data impact the ability to handle low-resource dialects, even among top-performing systems.

Of these extend beyond technical performance to cultural and social dimensions. If AI systems cannot adequately understand Tunisian Arabic, users may be pushed toward foreign languages for digital interactions, potentially eroding linguistic heritage and accessibility. The study calls for increased efforts to include low-resource languages in AI development, such as creating larger datasets and engaging in national collaborations to ensure linguistic inclusivity. While the dataset used is modest, serving as a pilot-scale initiative, it provides a reproducible baseline for future work, emphasizing the need for dialect-specific benchmarks to guide model improvements and foster technological equity.

Limitations of the study include the dataset's small size of 100 samples, which means are indicative rather than statistically conclusive, and the reliance on a single annotator for consistency, though this ensures high-quality annotations. Additionally, the evaluation was conducted on model versions as of November 9, 2025, and performance may evolve with updates, highlighting the importance of ongoing benchmarking. Future research should expand the dataset to support robust comparisons and explore factors like model architecture and training data diversity to better address s of underrepresented dialects in AI systems.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn