Scaling laws

Empirical observations that LLM performance improves predictably as you increase model size, training data, and compute — fitted as power-law curves.

Scaling laws are empirical observations — first formalized by OpenAI in 2020 (Kaplan et al.) and refined by DeepMind in 2022 (Chinchilla, Hoffmann et al.) — that show LLM loss falls predictably as you increase three things: parameter count, training tokens, and total compute. The relationship is a power law: doubling compute typically reduces loss by a known fraction. They matter because scaling laws turned LLM development from craft to recipe. Once you know the curve, you can plan: "to halve the loss, we need 10× more compute and 5× more data." That's how labs justify $100M+ training runs in advance — they're not guessing, they're extrapolating a measured curve. Most progress in capabilities since GPT-2 has come from scaling, not architecture changes. A concrete example: GPT-3 (175B params) outperformed GPT-2 (1.5B) not because of architectural innovation but because it was 100× larger, trained on more data, and ran more compute. The scaling-laws prediction said this would help; it did. Chinchilla showed Kaplan's original law slightly miscalibrated the data-vs-parameters trade-off — Llama and most modern open-source models train on substantially more data per parameter than GPT-3 did. The debate: do scaling laws keep going? Some researchers argue we're hitting diminishing returns and need new ideas; others see the curves continuing. Test-time scaling (o1, DeepSeek R1) and Mixture-of-Experts opened new dimensions. Related: emergent abilities, frontier model, Chinchilla, compute.