Synthetic Data for LLM Training: Why the 30% Limit Matters

TL;DR

A 100,000 GPU-hour study on 1,000+ models shows mixing 30% synthetic data speeds up training tenfold, but going higher causes model collapse.

The question of how much synthetic data a large language model can absorb before its outputs begin to degrade has moved from theoretical speculation to empirical measurement. A team from Meta, Virginia Tech, and other institutions ran more than 1,000 pre-training experiments consuming over 100,000 GPU hours to map the scaling laws of synthetic data in LLM pre-training. Their findings, published in Demystifying Synthetic Data in LLM Pre-training, draw a clear line: rephrased synthetic data mixed at roughly 30% with natural web text can accelerate training by 5 to 10 times, but pure synthetic corpora, particularly textbook-style generated text, show degradation patterns consistent with model collapse.

The distinction between synthetic data types matters more than most practitioners assume. Rephrased data, where a capable model rewrites existing human text while preserving its factual structure, behaves fundamentally differently from generated data, where a model produces novel text from a prompt. The study found that rephrased synthetic data alone still underperforms natural web text. The acceleration only appears in the mixture: one-third rephrased synthetic, two-thirds organic human writing. At larger data budgets, this blend reaches the same validation loss that pure human data achieves, but arrives there 5 to 10 times faster.

The scaling dynamics carry a counterintuitive finding. Larger generator models do not necessarily produce better pre-training data than models in the 8 billion parameter range. A frontier model synthesizing training text for a smaller student offers diminishing returns past a certain capability gap. The practical implication is that organizations do not need their most expensive models running data generation pipelines. A well-tuned mid-range model produces synthetic data of comparable training value.

The model collapse question

A separate line of research from Stanford, led by Matthias Gerstgrasser, Rylan Schaeffer, and collaborators including David Donoho and Sanmi Koyejo, directly addressed whether model collapse is inevitable when models train on their own outputs. Their paper, Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data, tested causal transformers for language modeling, diffusion models for molecule generation, and variational autoencoders on image data.

The answer depends entirely on how the data pipeline is managed. When synthetic data replaces real data across successive training generations, collapse is reliable and progressive. The tails of the original distribution vanish first. Rare patterns, minority viewpoints, unusual phrasings, and specialized knowledge disappear from the model's output space. Each generation produces a slightly more generic, slightly less capable model.

When synthetic data accumulates alongside the original human corpus rather than replacing it, collapse does not occur. The error rate converges to a finite upper bound independent of the number of synthetic generations. This finding held across every architecture and hyperparameter configuration tested. The practical takeaway is architectural: data pipelines must append synthetic generations to the training set, never substitute them.

The human data constraint

The urgency behind this research connects to a resource constraint that the industry has been approaching for years. Projections from Epoch AI estimate that if current scaling trends continue, language models will require training datasets roughly equal to the entire stock of publicly available human-generated text sometime between 2026 and 2032. The agentic AI systems now being deployed across enterprises, including frameworks like Google's Gemma 4 and Anthropic's Claude with tool use, depend on pre-training quality that only human-written text has reliably delivered.

The 30% threshold identified by the Meta-led study is not a hard physical limit. It is the empirical boundary where returns from synthetic data begin to flatten under current generation techniques. Rephrasing preserves the distributional properties of human text closely enough that models treat it as slightly noisy real data. Pure generation introduces systematic biases that compound across scale. The difference is not subtle. Textbook-style synthetic data produced notably higher loss on downstream domain evaluations, particularly in knowledge-intensive tasks where factual grounding matters.

What this means for open model development

For the growing ecosystem of open-weight models, synthetic data management is now a core infrastructure concern. The ability to train competitive models depends on accessing or generating enough high-quality data to keep pace with proprietary labs that have exclusive licensing agreements with publishers, social platforms, and content archives. Knowing that synthetic data can safely constitute up to 30% of the pre-training mix, and that accumulation rather than replacement avoids collapse, gives open model developers a workable framework. But it also confirms that human-generated data remains the irreplaceable foundation. No amount of model-on-model synthesis can substitute for the diversity, specificity, and unpredictability of text written by people with direct experience of the world.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn