AI Finds Hidden Word Patterns in Ancient Languages

TL;DR

A machine learning tool scans Sulawesi languages for unusual vocabulary, upending origin theories and exposing a distinct sound-based fingerprint.

A new study uses machine learning to uncover hidden patterns in the basic vocabulary of Sulawesi languages, offering fresh insights into linguistic history without relying on traditional comparative s. Researchers analyzed 1,357 lexical forms from six Austronesian languages in Sulawesi, identifying a subset of words that resist classification as inherited from Proto-Austronesian. This approach addresses a long-standing question in linguistics: whether non-conforming vocabulary represents remnants of pre-Austronesian substrate languages or independent innovations. By applying computational techniques to publicly available data from the Austronesian Basic Vocabulary Database, the study provides a scalable tool that complements expert analysis, potentially reshaping how linguists detect and interpret lexical anomalies across language families.

The key finding is that machine learning can distinguish non-mainstream vocabulary from inherited Austronesian words based solely on phonological features, achieving an area under the curve (AUC) of 0.763 after removing confounding factors. This performance indicates moderate reliability, with the model identifying a "phonological fingerprint" characterized by longer forms, more consonant clusters, higher rates of glottal stops, and fewer canonical Austronesian prefixes. For example, non-mainstream candidates average 2.57 syllables compared to 2.29 for inherited forms, and they show glottal stops in 32.0% of cases versus 10.5% in consensus Austronesian vocabulary. The fingerprint is robust across languages, as confirmed by leave-one-language-out validation where five of six languages achieved AUCs of 0.65 or higher, and it generalizes to additional Sulawesi languages with a mean predicted substrate rate of 0.606.

Ology combines a rule-based cognate subtraction approach with a machine learning classifier trained exclusively on phonological features, avoiding circular reasoning. First, the rule-based identified 438 candidate substrate forms (26.5% of the corpus) by subtracting known Austronesian cognates and loanwords, then cross-checking with Proto-Austronesian reconstructions. Next, an XGBoost classifier was trained on 26 features, including form length, consonant cluster count, glottal stop presence, and semantic domain, while excluding all cognacy data to ensure independence. Validation involved stratified 5-fold cross-validation and leave-one-language-out testing, with SHAP analysis used to interpret feature importance. This two-model design separates a circular baseline that includes cognacy features from the genuine experiment, establishing that phonological properties alone carry detectable signal about a form's status.

From the analysis reveal that 266 high-confidence non-mainstream candidates were identified through cross-consensus between rule-based and machine learning predictions, with substantial agreement (Cohen's κ = 0.611). These candidates are over-represented in action verbs, comprising 44.0% of the consensus substrate forms, suggesting this semantic domain may be more vulnerable to substrate retention or innovation. However, phonological clustering of these candidates showed no evidence of shared etymological descent, with silhouette scores near zero and a cross-linguistic cognate test yielding a non-significant p-value of 0.569. This indicates that the non-mainstream vocabulary likely from parallel independent innovations rather than remnants of a single pre-Austronesian language layer, challenging traditional substrate interpretations.

Of this research extend beyond Sulawesi, offering a template for computational non-conformity detection in other language families and contact situations. The study demonstrates that machine learning can serve as a scalable screening tool to flag phonologically anomalous forms for specialist attention, complementing but not replacing the comparative . For instance, identified potential false positives like numeral compounds (e.g., "Fifty" and "Twenty") that mimic the substrate fingerprint due to morphological complexity, highlighting areas for refinement. Geographic expansion to 16 additional languages confirmed patterning, with Sulawesi languages showing higher predicted substrate rates than Western Indonesian languages, though outliers like Acehnese with its known Chamic heritage were also detected. This approach could accelerate research in historical linguistics by automating initial analyses of large datasets.

Limitations of the study include reliance on orthographic rather than standardized phonemic transcriptions, though approximate IPA conversion tests showed negligible performance impact. The small sample size of 1,357 forms and label noise from the Positive-Unlabeled learning problem mean should be interpreted as suggestive, with some non-mainstream candidates possibly being Austronesian forms missing cognacy data. Additionally, the lack of morphological decomposition means features like consonant clusters may arise from productive morphology rather than root-internal patterns, and the geographic scope is limited to Sulawesi, though expansion tests provide encouraging generalization. Future work could involve full IPA conversion, expansion beyond Swadesh lists, and targeted fieldwork on high-confidence candidates to distinguish genuine pre-Austronesian remnants from innovations.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn