Adapting large language models to new languages or specialized fields like math and coding is a costly and time-consuming process, often requiring weeks of trial and error to find the right mix of training data. Traditionally, researchers must decide on data mixture ratios before training begins, and a poor choice can waste immense computational resources before its effects become apparent. This bottleneck has limited the efficiency of continual pre-training, a common technique for tailoring models to specific needs, making it a high-stakes gamble for organizations with limited budgets.
Researchers Haiyue Song and Masao Utiyama from Japan's National Institute of Information and Communications Technology have developed a called OPTIMER that decouples data ratio selection from model training, allowing optimization after the fact. Instead of fixing data mixtures upfront, OPTIMER trains separate models on each dataset independently, extracts what they call 'distribution vectors'—representing the parameter shifts induced by each dataset—and then searches for optimal combination weights using Bayesian optimization. This approach consistently outperformed traditional data mixing and model averaging baselines in experiments, with improvements of 2.1 to 6.7 points on average scores across tasks.
Ology involves training one continual pre-training model per dataset, such as Japanese, Chinese, math, or code data, each on 1 billion tokens, using the Gemma 3 27B model as a base. After training, distribution vectors are extracted by calculating the difference between each trained model's parameters and the base model's parameters. These vectors are then combined with an instruction-tuning vector to restore general capabilities, and Bayesian optimization via the Tree-structured Parzen Estimator searches for the best merge weights in minutes per trial, compared to days or weeks for traditional s. The search process, detailed in Algorithm 1 of the paper, uses a development set to score candidate combinations efficiently, converging within about 100 trials.
From 16 benchmarks covering English, Japanese, Chinese, math, and code tasks show that OPTIMER achieved the highest average scores in all dataset combinations tested, such as Japanese plus math and Japanese plus Chinese plus math. For example, in the Japanese plus math combination, OPTIMER scored 84.46 on GSM8K math problems, outperforming data mixture baselines. also maintained strong TruthfulQA scores between 51 and 55, where other s degraded to 30–49, indicating better preservation of the base model's calibration. Additionally, the optimized weights can be interpreted as data mixture ratios; retraining with these ratios improved data mixture continual pre-training, as shown by DataMixOptiMer ratio models outperforming uniform ratio baselines.
Of this research are significant for both academic and industrial settings, as it reduces the computational cost and time required to adapt AI models to new languages or domains. OPTIMER's flexibility allows the same set of distribution vectors to be re-optimized for different objectives without retraining, enabling on-demand creation of target-tailored models. For instance, re-optimizing for Japanese tasks yielded the best overall performance, suggesting cross-lingual benefits. This could accelerate the development of multilingual AI tools and specialized applications in areas like education or coding assistance, making AI adaptation more accessible and efficient.
However, the study acknowledges limitations, including that OPTIMER was tested on models trained with 1 billion tokens, and larger-scale training might require iterative approaches to prevent divergence from the base model. The experiments were conducted only on Gemma 3 27B and Gemma-SEA-LION-v4-27B models, leaving generalization to other architectures like Llama-3 or Qwen-3 for future work. Additionally, while OPTIMER outperformed uniform data mixing, it was not compared to advanced ratio optimization s like DoReMi or RegMix, which could narrow the performance gap. The evaluation used 1-shot prompting across all benchmarks, so absolute scores may differ from leaderboard with task-specific settings, though relative rankings remain valid.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn