Small AI Models Beat Larger Ones at Turkish Search

TL;DR

A new benchmark shows compact AI models outperform bigger systems in Turkish search, delivering faster, more accurate results for this complex language.

Artificial intelligence systems that power search engines and question-answering tools have advanced rapidly for languages like English, but they often struggle with languages that have rich grammatical structures and fewer digital resources. Turkish, with its complex morphology where words can have many suffixes to convey meaning, presents a particular . Researchers from NewMind AI have introduced TurkColBERT, the first comprehensive benchmark comparing different AI approaches for Turkish information retrieval, revealing that smaller, more efficient models can deliver superior performance in this context.

The key finding from the study is that late-interaction models, which analyze text at the token level rather than compressing it into a single vector, consistently outperform dense encoder models across various Turkish retrieval tasks. For instance, ColmmBERT-base-TR, a late-interaction model with 310 million parameters, achieved the highest mean Average Precision (mAP) on four out of five benchmark datasets. On SciFact-TR, a scientific fact-checking dataset, it reached 70.0% Recall@10, surpassing dense baselines like TurkEmbed4Retrieval (60.5%) and turkish-e5-large (63.3%) by significant margins. This demonstrates that token-level matching is especially effective for capturing the nuanced semantics of Turkish, where word forms change extensively through inflection.

Ology involved a two-stage adaptation pipeline to tailor models for Turkish. In the first stage, pretrained encoders such as mmBERT, Ettin, and BERT-Hash variants were fine-tuned on Turkish Natural Language Inference (all-nli-tr) and semantic similarity (STSb-tr) tasks to improve their understanding of Turkish sentence-level meaning. For example, mmBERT-small showed a Spearman correlation of 0.78 on STSb-tr after this phase. In the second stage, these adapted models were converted into ColBERT-style retrievers using the PyLate framework, trained on the Turkish adaptation of MS MARCO-TR. This process enabled the models to perform late-interaction retrieval, where token embeddings are preserved for detailed matching during search.

Analysis across five Turkish BEIR datasets—SciFact-TR, Arguana-TR, Fiqa-TR, Scidocs-TR, and NFCorpus-TR—highlighted the efficiency and effectiveness of late-interaction models. ColmmBERT-small-TR, with 140 million parameters, achieved 70.3% Recall@10 and 55.4% mAP on SciFact-TR, nearly matching the performance of its larger counterpart while using less than half the computational resources. The study also evaluated indexing algorithms for production readiness: MUVERA+Rerank was 3.33 times faster than PLAID on average, with query latency as low as 0.54 ms for ColmmBERT-base-TR, and it offered a +1.7% relative mAP gain. Figure 1 illustrates the trade-offs, showing that higher encoding dimensions in MUVERA lead to faster retrieval but slightly lower NDCG@100, with MUVERA+Rerank recovering near-PLAID quality at 4–5 times the speed.

Of this research are significant for real-world applications, as it enables more accurate and faster search systems for Turkish speakers, particularly in specialized domains like finance, science, and nutrition. For example, on Fiqa-TR, a financial question-answering dataset, ColmmBERT-base-TR achieved a 19.5% mAP, a substantial improvement over dense models. The ultra-compact colbert-hash-nano-tr model, with only 1.0 million parameters, retained over 71% of the average mAP of the 600-million parameter turkish-e5-large dense encoder, demonstrating that high-quality Turkish retrieval is feasible even on resource-constrained devices. This could benefit industries relying on Turkish-language data analysis, from academic research to customer support.

However, the study has limitations, as noted in the paper. It relies on moderately sized datasets with up to 50,000 documents and translated benchmarks, which may not fully reflect real-world Turkish retrieval conditions. Larger-scale evaluations of MUVERA indexing are necessary to assess scalability in production systems. Additionally, the benchmarks are adapted from English, potentially missing nuances of native Turkish content. Future work should explore web-scale testing, morphology-aware tokenization, and the development of native Turkish benchmarks to further advance the field.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn