What is Knowledge Distillation for LLMs?
LLMs are good at thinking and solving hard problems.
But they are very large and need a lot of computing power.
This makes them hard to use on small devices but I kind of love experimenting on my Raspberry Pi
Knowledge distillation is a way to help by moving knowledge from big models to small ones.
This helps small models work well while using less power.
DeepSeek is built on top of distilled models like Qwen and Llama.
Knowledge Distillation?
Knowledge distillation is a process where a small model (called the "Student") learns from a larger, well-trained model (called the "Teacher").
Instead of learning directly from data, the student model learns from the outputs of the teacher model.
These outputs include "soft" probabilities and reasoning patterns.
This method helps the student model become smarter and more efficient than training it from scratch.
How Does Distillation Improve Reasoning in Small Models?
Learning from Teacher’s Reasoning Patterns:
The student model mimics the way the teacher solves problems.
This helps it develop better reasoning skills, such as breaking down problems and identifying key information.
Better Generalization:
Large models are trained on vast amounts of data, making them good at handling different situations.
When a small model learns from a large one, it inherits some of this ability, making it more reliable.
Higher Efficiency:
Distilled models are much smaller and require less computing power.
This makes them suitable for mobile phones, smart devices, and other low-power systems like the Raspberry Pi.
Recent Advances in Distillation for Reasoning
Chain-of-Thought (CoT) Distillation:
Large models explain their reasoning step by step. Teaching these explanations to smaller models improves their ability to reason logically.
Socratic CoT:
This method breaks down complex problems into smaller parts. Different small models are trained to handle different steps, making the overall reasoning process more effective.
Reinforcement Learning with Distillation:
The student model gets feedback from the teacher on how well it is reasoning. This helps refine its decision-making and improve accuracy.
Knowledge distillation is a powerful way to make small models smarter and more efficient. By learning from large models, small models can reason well while using fewer resources.