AI Finds Smarter Way to Train Smaller Models

TL;DR

A new method uses AI's internal knowledge map to generate training data that targets weak spots, boosting performance by up to 39% without extra examples.

As artificial intelligence models grow larger, they become more powerful but also more resource-intensive, raising concerns about sustainability and accessibility. Researchers have turned to synthetic data generation (SDG) as a way to improve smaller, more efficient models by using a larger 'teacher' model to create training examples. However, a key has been ensuring this synthetic data is diverse and high-quality, as traditional s often produce repetitive or unbalanced examples that fail to address a model's specific weaknesses. A new study from IBM Research introduces a targeted approach that analyzes a model's internal representation of data to generate synthetic examples precisely where the model struggles, leading to significant performance gains.

The researchers discovered a strong correlation between the density of training examples in a model's embedding space—a kind of internal map where similar data points cluster together—and the model's accuracy in those regions. In simpler terms, areas with fewer training examples correspond to where the model performs poorly, while denser areas show better performance. This insight, supported by statistical analysis with Pearson's correlation coefficient of 0.813 and Spearman's coefficient of 0.806, indicates that data sparsity is a major factor in accuracy disparities. By targeting sparse regions, aims to fill these gaps with synthetic data, effectively teaching the model what it doesn't know.

To implement this, the team developed a pipeline that first computes embeddings for each example in a labeled dataset using the target model, reducing dimensionality to 2D or 3D for visualization. They then identify sparse regions in this embedding space by applying a grid and thresholding based on density, as illustrated in Figure 2. For each sparse region, two seed examples from the existing data are selected from opposing sides, and their embeddings are interpolated to create a new point in the same sparse area. This interpolation involves averaging weighted token embeddings and attention weights, then decoding the result back into natural language using a prompt that instructs the model to copy the input. Finally, a teacher model generates a new synthetic example based on these seeds and the decoded text, as shown in Figure 1.

The experimental evaluation focused on math reasoning tasks, using three target models—Granite 3 8B code instruct, Granite 3.1 8b instruct, and Mistral 7B—and datasets like MetaMathQA, GSM8K, and MATH. in Table 1 demonstrate that the embedding-based , dubbed EmbedSDG, consistently outperforms random seed selection across all models and benchmarks. For instance, Mistral 7B on GSM8K saw accuracy jump from 0.354 with random selection to 0.62 with EmbedSDG using 500 examples, a 39% improvement. Similarly, Granite 3.1 instruct on MATH improved by up to 16% from the base model. was most effective with fewer examples, as it targets the sparsest regions first, with performance gains diminishing as more data is added and sparsity lessens.

This approach has practical for making AI more efficient and accessible, as it allows smaller models to achieve performance closer to larger ones without requiring massive datasets or computational resources. By focusing on a model's specific weaknesses, it could lead to more tailored AI systems in fields like education or specialized domains where data is scarce. The researchers note that their builds on prior work but differs by operating in the embedding space and considering the target model's shortcomings, offering a more nuanced way to enhance synthetic data quality.

However, the study has limitations. It was evaluated only on three models and two math datasets, which may limit generalizability to other domains, as the approach relies on models being fine-tuned on disclosed datasets. Additionally, while aims to improve smaller models, computational resources for larger variants remain a barrier in resource-constrained environments. Future work could explore multi-task embedding spaces to generate more complex instructions, potentially expanding 's applicability beyond math reasoning.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn