Knowledge distillation

Training a smaller "student" model to match the outputs of a larger "teacher" model, producing a cheaper model that retains much of the teacher's quality.

Knowledge distillation trains a small model by having it learn to imitate a larger one. Instead of training the student on raw labeled data, you feed it the teacher's predictions — often the full probability distribution over tokens, not just the top answer. That richer signal lets the student absorb nuanced behavior that's hard to learn from labels alone. It matters because frontier models are too expensive to deploy at scale. A 70B model can be impractical for production at high QPS, but a distilled 7B that captures 90% of its quality is shippable. Most "flash" or "mini" model variants (GPT-4o-mini, Claude Haiku, Gemini Flash) are partly the result of distillation from their larger siblings. A concrete example: Stanford's Alpaca was an early demonstration — they fine-tuned a 7B Llama using outputs from text-davinci-003 (a much larger model) and got surprisingly capable instruction-following from a small model on a tiny budget. The Chinese open-source community has used the same pattern aggressively — fine-tuning small models on outputs from GPT-4 or Claude. A legal note: most commercial APIs prohibit using their outputs to train competing models. The technique is well-established academically; the deployment realities depend on which provider's data you use. Related: fine-tuning, teacher-student, model compression, SFT.