How Training Data Shapes AI Language Understanding

TL;DR

A new study finds multilingual AI models encode languages in distinct patterns from layer one, with training data leaving a lasting imprint on fairness...

Multilingual large language models like Llama-3.1 and Qwen2.5 have become essential tools for global communication, capable of handling over 50 languages with human-level performance. However, a persistent performance gap exists, where models often perform 10–30% worse in non-English languages compared to English, a disparity commonly blamed on the English-centric nature of most training data. A new study investigates whether this data imbalance merely affects output accuracy or fundamentally reshapes how these models internally represent languages, with for designing fairer and more effective AI systems.

The researchers discovered that language information becomes sharply separated in the very first transformer block of these models, with a dramatic 76.4% increase in classification accuracy from the initial embedding layer to the first layer. This separation remains almost fully linearly separable throughout all 268 layers analyzed, with linear probes achieving 99.8% accuracy on average and only a 0.58% gap compared to more complex nonlinear probes. This indicates that languages are encoded as distinct, accessible directions in the model's internal space rather than as complex patterns requiring deep processing.

To uncover how this structure emerges, the team conducted a comprehensive probing study across six multilingual LLMs, including English-centric models like Llama-3.1-8B and Chinese-inclusive models like Qwen2.5-7B. They trained linear and nonlinear classifiers on hidden representations from each layer, using sentences in five languages: English, Spanish, Chinese, French, and German. They introduced a new analysis called Token–Language Alignment, which measures how closely the learned language directions align with vocabulary embeddings, providing insights into the geometric imprint of training data.

Show a clear structural imprinting effect tied to pretraining data composition. Chinese-inclusive models, which include about 20% Chinese data, achieved a Match@Peak alignment of 16.43% for Chinese, while English-centric models with over 80% English data achieved only 3.90% for Chinese—a 4.21 times difference. This demonstrates that the geometry of the representation space is shaped by the training distribution, with languages like Chinese forming clearer directions when sufficiently represented. Additionally, languages like Spanish and German reached peak separability earlier in the network, while Chinese required deeper layers, highlighting typological effects.

These have significant for achieving fairness in multilingual AI. The structural imprinting suggests that post-hoc adjustments, such as fine-tuning, cannot fundamentally alter the underlying geometry established during pretraining. Instead, careful design of data composition from the start is crucial for balanced representation. The study also provides practical tools, like Match@Peak metrics, for diagnosing biases and evaluating the effects of data rebalancing, offering a pathway to more equitable AI systems.

However, the study has limitations. The analysis focused on only five languages and six models, leaving questions about how these patterns extend to other languages and scripts. The low overall Match@Peak alignment, averaging 15.0%, indicates that language directions capture abstract features beyond simple token matching, including script and lexical frequency, which complicates interpretation. Future research could expand to more languages, investigate tokenizer design influences, and conduct interventional studies to better understand causal relationships in language representation.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn