Chengshuo Dai
Back to Blog

Hitting the Data Wall: Scaling Laws and Rise of Synthetic Data

LLM FundamentalsSupervised Fine-tuning

The trajectory of Large Language Model (LLM) development has been largely defined by empirical Scaling Laws. These laws, most notably formalized by DeepMind's Chinchilla paper, dictate that a model's performance improves predictably as a power-law function of three variables: the amount of compute used for training, the number of parameters in the model, and the size of the training dataset. The Chinchilla findings established a crucial ratio: for compute-optimal training, the number of training tokens should be scaled proportionally with the number of model parameters, typically at a ratio of 20 tokens per parameter.

This insight fundamentally shifted the industry's approach. Instead of simply building larger and larger models (like the 175-billion parameter GPT-3, which was trained on a relatively small dataset of 300 billion tokens), developers began training smaller models on vastly larger datasets. For example, Meta's LLaMA 3 8B model was trained on a staggering 15 trillion tokens, pushing far beyond the Chinchilla-optimal point to prioritize inference efficiency over training compute efficiency. However, this insatiable appetite for data has led the AI industry to a critical juncture: the impending "Data Wall."

Researchers estimate that the total stock of high-quality, human-generated text on the internet—including books, scientific papers, Wikipedia articles, and curated web data—is finite. At the current rate of consumption, leading AI labs are projected to exhaust the entire supply of high-quality public data within the next few years. Once this limit is reached, simply throwing more compute at larger models will yield diminishing returns, as the models will lack the necessary novel information to continue improving according to the established scaling laws.

To circumvent this Data Wall, the industry has aggressively pivoted towards Synthetic Data generation. Synthetic data refers to text, code, or mathematical reasoning generated not by humans, but by advanced, pre-existing LLMs (often referred to as "Teacher" models). The premise is that a highly capable model like GPT-4 can generate vast quantities of high-quality training data to train smaller, more efficient "Student" models, or even to bootstrap the next generation of frontier models.

One of the earliest and most influential techniques in this domain is Self-Instruct. The Self-Instruct framework begins with a small seed set of human-written instructions and responses. A Teacher LLM is then prompted to generate new, diverse instructions based on the seed set, and subsequently generate the corresponding responses. This process is iteratively repeated, filtering out low-quality or highly similar outputs, to create a massive dataset of synthetic instruction-tuning data. This technique was instrumental in the creation of early open-source models like Alpaca, which demonstrated that a small model could achieve impressive conversational abilities when fine-tuned on data generated by a larger, proprietary model.

Building upon Self-Instruct, researchers developed more sophisticated methods like Evol-Instruct. While Self-Instruct generates diverse prompts, it often struggles to generate highly complex or challenging reasoning tasks. Evol-Instruct addresses this by taking an existing instruction and systematically "evolving" it to increase its difficulty. The Teacher LLM is prompted to apply various mutation strategies, such as adding constraints, deepening the required reasoning steps, or complicating the input format. For example, a simple prompt like "Write a Python function to sort a list" might be evolved into "Write a highly optimized Python function to sort a list of dictionaries by a specific key, handling potential KeyError exceptions and ensuring O(n log n) time complexity."

The resulting dataset of evolved instructions and their corresponding responses (generated by the Teacher model) provides a much richer and more challenging training signal for the Student model. This approach was famously used to train the WizardLM series of models, which achieved state-of-the-art performance on complex reasoning benchmarks. As the supply of human data dwindles, the continuous refinement of synthetic data generation techniques—ensuring diversity, accuracy, and complexity—will be the primary driver sustaining the exponential progress dictated by LLM Scaling Laws.

References:

  1. Will we run out of data? An analysis of the projected data shortfall in machine learning - https://arxiv.org/abs/2211.04325
  2. WizardLM: Empowering Large Language Models to Follow Complex Instructions (Evol-Instruct) - https://arxiv.org/abs/2304.12244