Shona Language Tool Hits 90% Accuracy in NLP Processing

TL;DR

A new morphological analyzer decodes Shona grammar with over 90% accuracy, helping bring this widely spoken African language into mainstream AI research.

A new computational tool is making strides in bringing the Shona language, spoken by over 10 million people in Zimbabwe and neighboring regions, into the realm of modern artificial intelligence. Despite rapid advances in multilingual natural language processing (NLP), many African languages, including Shona, have remained severely under-served with respect to morphological analysis and language-aware tooling. This gap contributes to digital exclusion, where native speakers are underrepresented in conversational AI, educational platforms, and translation systems, highlighting the urgent need for accessible computational resources.

Researchers have developed Shona spaCy, an open-source, rule-based morphological pipeline that achieves over 90% accuracy in part-of-speech tagging and 88% accuracy in morphological feature analysis for Shona. The system combines a curated JSON lexicon with linguistically driven morphological rules to model key grammatical elements such as noun-class prefixes, verbal subject-concords, tense and aspect markers, ideophones, and clitics. By integrating these into token-level annotations, including lemma, part-of-speech, and morph_features, the tool provides full transparency in its linguistic decisions, enabling robust analysis of both formal and informal Shona text.

Ology behind Shona spaCy involves a hybrid approach that merges a manually curated JSON lexicon with computationally defined grammatical rules derived from classical Shona linguistic studies. The lexicon, containing approximately 2,500 entries, covers core nouns and verbs, closed-class items like pronouns and conjunctions, and ideophones—expressive forms conveying sensory intensity. For tokens not found in the lexicon, the system applies rule-based modules that detect morphological features such as noun class prefixes (e.g., mu- for Class 1 humans, va- for Class 2 plurals), verb subject concords (e.g., ndi- for first person singular), and derivational suffixes like -is- for causative forms. This pipeline processes text through steps including tokenization, lexicon lookup, rule-based analysis, and feature encoding, outputting structured annotations that align with established Shona grammatical conventions.

Evaluation of Shona spaCy on a dataset of 1,500 sentences (approximately 14,200 tokens) from sources like Shona Wikipedia and local storytelling corpora revealed strong performance metrics. The system achieved a lexical coverage of 62.4%, meaning tokens directly matched in the JSON lexicon, and a rule coverage of 94.1%, indicating successful analysis of unseen or derived forms via rule-based logic. Overall part-of-speech accuracy reached 90.7%, while morphological accuracy, covering noun class, tense, and derivational features, was 88.3%. Noun class identification specifically attained 92.5% accuracy, with errors primarily involving ambiguity between Class 9 and Class 10 prefixes due to overlapping phonological realizations, as noted in linguistic literature.

Of this work extend beyond technical accuracy, advancing digital inclusion by enabling Shona speakers to access language-aware AI systems and supporting linguistic preservation through explicit encoding of Shona grammar in machine-readable form. By providing a template for morphological analysis tools for other under-resourced Bantu languages, such as Ndebele, Kalanga, and Swahili, the project contributes to computational decolonization—creating AI systems that reflect and respect African linguistic diversity. The tool is available as an open-source Python package via pip install shona_spacy, with code accessible on GitHub and distribution on PyPI, facilitating integration into downstream NLP applications like named entity recognition and machine translation.

Despite its successes, Shona spaCy faces limitations, including an unknown token rate of 5.9%, where some tokens could not be analyzed due to gaps in the lexicon or rule set. The system's reliance on a modest lexicon of about 2,500 entries may limit coverage for highly specialized or emerging vocabulary, and errors in noun class detection, particularly with ambiguous prefixes, highlight s in fully capturing Shona's phonological nuances. Additionally, while the tool handles informal text and code-mixing to some extent, further development is needed to improve robustness across diverse digital communication styles, as everyday speech often blends Shona with English, producing expressions that conventional models misclassify. These limitations underscore the ongoing need for expanded datasets and iterative refinement to enhance the analyzer's applicability in real-world scenarios.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn