Decoding LLM Scaling Laws: Compute, Data, and Parameter Optimization
The development of Large Language Models is heavily guided by empirical scaling laws, which describe how model performance improves as a power-law relationship with compute, dataset size, and parameter count. Understanding these laws is critical for efficiently allocating massive computational resources during pre-training.
The initial breakthrough in this area was published by Kaplan et al. at OpenAI in 2020. Their research suggested that model performance scales predictably with the number of parameters and the amount of compute, largely independent of the specific architectural details (like depth vs. width). A key takeaway from the Kaplan scaling laws was the recommendation to scale model size faster than dataset size. Consequently, many early models, such as GPT-3 (175B parameters), were trained on relatively small datasets (around 300 billion tokens) relative to their massive parameter counts.
However, this paradigm was significantly shifted by DeepMind's Chinchilla paper in 2022. The researchers discovered that previous models were significantly under-trained. By systematically varying both model size and training tokens under a fixed compute budget, they established the "Chinchilla Optimal" scaling laws. Their findings demonstrated that model size and training data should be scaled in equal proportions. Specifically, for every parameter in the model, there should be approximately 20 training tokens.
This revelation meant that a smaller model trained on vastly more data could outperform a much larger model trained on less data, while also being significantly cheaper to deploy for inference. For example, Chinchilla (70B parameters) outperformed Gopher (280B parameters) because it was trained on 1.4 trillion tokens compared to Gopher's 300 billion. This principle now guides the training of modern open-weights models like LLaMA, which are often trained well beyond the Chinchilla-optimal point (e.g., LLaMA-3 8B trained on 15 trillion tokens) to prioritize inference efficiency over training compute efficiency.
References:
- DeepMind Blog: An empirical analysis of compute-optimal large language model training - https://deepmind.google/discover/blog/an-empirical-analysis-of-compute-optimal-large-language-model-training/
- OpenAI Research: Scaling Laws for Neural Language Models - https://arxiv.org/abs/2001.08361