Language extinction threatens cultural diversity, with over 40% of the world's languages at risk. The Toto language, spoken by fewer than 1,700 people in West Bengal, India, is one such endangered tongue, facing loss due to shifting demographics and limited documentation. This research addresses that gap by integrating linguistics and AI to create a sustainable preservation tool, making it vital for anyone concerned with cultural heritage and technological applications in social science.
The key finding is the development of a trilingual Toto-Bengali-English application that uses AI to document, analyze, and teach the language. Researchers built a morpheme-tagged corpus from fieldwork, enabling a transformer-based model to handle translation and learning tasks. This approach captures Toto's unique grammatical features, such as inflectional morphology for person-number-gender agreement and tense-aspect-mood distinctions, providing a digital archive that supports language revitalization.
Methodology involved extensive fieldwork in Totopara, where researchers collected audio and textual data through structured interviews and spontaneous dialogues. They used tools like Zoom recorders and GoPro cameras to capture utterances, which were then translated and annotated for morphemes. The corpus, stored in JSON and TSV formats, was processed with linguistic techniques like sentence tokenization and augmentation to train a small language model (SLM) with a distilled transformer architecture of 2–4 layers and about 5 million parameters, tailored for low-resource settings.
Results show that the model, trained on approximately 20,000 sentences, achieves accurate translation between Toto, Bengali, and English, as evaluated by BLEU and human acceptability scores. For instance, the corpus includes detailed morphological analyses, such as plural morphemes like -bɪ in examples like ceŋ-bɪ (children) and tense markers like -mi for present and -na for past, illustrating Toto's grammatical structure. The application also supports script standardization using Unicode, enhancing accessibility for non-native learners and ensuring the language's digital survival.
In context, this work matters because it offers a replicable model for preserving other endangered languages, combining AI with community engagement to foster education and cultural identity. It allows Toto speakers, especially youth, to learn their mother tongue while gaining proficiency in Bengali and English, promoting economic mobility without cultural loss. For the broader public, it highlights how technology can safeguard intangible heritage, supporting global efforts like UNESCO's language initiatives.
Limitations include the small speaker population of about 1,600, which complicates data collection and verification, and technological challenges like limited orthographic materials affecting model accuracy. The paper notes that Toto's grammar shows variation across generations, requiring ongoing standardization, and the AI model's performance depends on further data expansion to handle nuances effectively.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn