Knowledge distillation is a model compression technique where a small 'student' model is trained to mimic the outputs, intermediate representations, or reasoning patterns of a much larger 'teacher' model. The result is a compact model that captures most of the teacher's capability while being dramatically cheaper to run. In 2026, distillation is the primary technique behind the small language model revolution — Phi-3, Llama 3.2, and DeepSeek's most efficient models are all heavily distilled from larger teachers.
The core idea: soft labels vs. hard labels
Standard supervised training gives a model hard labels: 'this is a cat, label=1.' Distillation uses soft labels — the full probability distribution the teacher assigns to every possible output. These soft distributions contain far more information than a single label: they reveal the teacher's uncertainty, secondary predictions, and the relationships between classes.
Hinton et al. (2015) distillation loss: a weighted sum of (1) standard cross-entropy with the ground truth, and (2) KL divergence between the teacher and student output distributions at temperature T. Temperature T > 1 softens the distributions, amplifying information in the non-maximum logits. α balances the two objectives.
| Training signal | Information content | Example |
|---|---|---|
| Hard label (standard training) | Binary — right or wrong | "cat" = 1, everything else = 0 |
| Teacher soft label (distillation) | Rich — reveals relationships | "cat" = 0.92, "lynx" = 0.05, "dog" = 0.02, "car" = 0.001 |
| Intermediate features | Richer — match internal representations layer by layer | Student layer 4 output ≈ Teacher layer 12 output (feature distillation) |
| Reasoning traces | Richest — match step-by-step thinking | Student generates same chain-of-thought steps as teacher (chain-of-thought distillation) |
Types of distillation used in modern LLMs
| Type | What is matched | How it works | Used in |
|---|---|---|---|
| Response distillation (black-box) | Final outputs only | Generate teacher outputs; train student on those outputs as supervised data | DeepSeek-R1 distilled models; most SLMs fine-tuned on GPT-4 outputs |
| Logit distillation (white-box) | Full output probability distributions | Requires access to teacher logits (not just text); uses KL divergence loss | Internal lab distillation pipelines; not possible with closed APIs |
| Feature distillation | Intermediate hidden states | Add auxiliary loss: student layer i output ≈ teacher layer j output | TinyBERT; DistilBERT; efficient vision models |
| Chain-of-thought distillation | Reasoning traces / thinking steps | Fine-tune student on teacher's step-by-step reasoning, not just final answers | Key to DeepSeek-R1 distillation; creates small reasoning models |
| Speculative decoding | Functional: student drafts tokens, teacher verifies | Not classic distillation but uses a small student as a draft model; teacher corrects in batches | GPT-4 inference optimization; Llama production serving |
Response distillation pipeline — the most practical form: fine-tune a small model on teacher outputs
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from trl import SFTTrainer
import anthropic
import json
# ── Step 1: Generate teacher outputs ─────────────────────────────────────
client = anthropic.Anthropic()
def get_teacher_response(prompt: str) -> str:
"""Get a high-quality response from Claude Sonnet (the teacher)."""
msg = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
messages=[{"role": "user", "content": prompt}]
)
return msg.content[0].text
# Your domain-specific prompts (e.g. customer support, medical Q&A, legal)
student_prompts = [
"What are the side effects of ibuprofen?",
"Explain the difference between a debit card and a credit card.",
# ... thousands more domain prompts
]
distillation_data = []
for prompt in student_prompts:
response = get_teacher_response(prompt)
distillation_data.append({
"text": f"<|user|>\n{prompt}\n<|assistant|>\n{response}"
})
# ── Step 2: Fine-tune a small student model on teacher outputs ────────────
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
dataset = Dataset.from_list(distillation_data)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
args=TrainingArguments(
output_dir="./phi3-distilled-domain",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-5,
# Use LoRA via peft_config for efficiency — see LoRA article
),
)
trainer.train()
# Result: A 3.8B model with domain expertise from a 100B+ teacher.Distillation vs. other compression techniques
| Technique | How it works | Compression ratio | Quality loss | Best for |
|---|---|---|---|---|
| Distillation | Train new smaller model to mimic the teacher | 10–100× parameter reduction | Low — the student is specifically trained to be accurate | When you can afford the training compute; best overall quality-per-size |
| Quantization | Reduce parameter bit-width (FP32 → INT8 or INT4) | 4–8× memory reduction; same architecture | Minimal with careful calibration | Deployment on existing models; no retraining needed; fastest to apply |
| Pruning | Remove individual weights or entire layers below a threshold | 2–10× parameter reduction | Moderate — requires fine-tuning after pruning to recover quality | Structured pruning of specific attention heads or FFN layers |
| Architecture search (NAS) | Automatically find the most efficient architecture for a target | Varies widely | Low — model is designed to be efficient from scratch | Large-scale production; resource-intensive to run |
In practice: combine them
The Phi-3 Mini was distilled from a large teacher model AND then quantized to INT4 for edge deployment. This stack — distillation for quality compression, then quantization for memory compression — is the standard recipe for deploying capable models on consumer hardware in 2026.
Practice questions
- What is 'dark knowledge' and why does it improve student model training? (Answer: Dark knowledge: the probability distribution a teacher model assigns to all classes (soft labels). Example: for an image of a cat, the teacher might output [cat: 0.7, tiger: 0.2, dog: 0.1]. The non-cat probabilities encode the teacher's knowledge about similarity relationships — tiger is more cat-like than dog. Training on hard labels [cat: 1, tiger: 0, dog: 0] loses this similarity structure. Soft labels carry much more information per example, enabling the student to learn better representations with less data.)
- What is temperature scaling in knowledge distillation and why is it important? (Answer: Temperature T in soft targets: p_i = exp(z_i/T) / Σ exp(z_j/T). High T flattens the distribution — makes the soft labels even softer, exposing more dark knowledge (tiny probabilities become more visible). Low T sharpens — approaches hard labels. Standard distillation uses T=3–5 for soft targets, T=1 for hard targets. The final loss combines soft loss (at T) + hard loss (at T=1): L = α × T² × KL(teacher_soft, student_soft) + (1-α) × CE(student_logits, hard_labels). T² rescales the soft gradient to match hard gradient magnitude.)
- What is the difference between offline distillation, online distillation, and self-distillation? (Answer: Offline: train teacher fully first, then train student on teacher outputs — classic approach. Teacher is fixed throughout student training. Online (mutual learning): teacher and student train simultaneously, sharing knowledge with each other. No pretrained teacher needed — multiple students teach each other. Self-distillation: a model distils knowledge to itself — deeper layers teach shallower layers, or later training epochs teach earlier epochs. Born-Again Networks: retrain same architecture using soft labels from a trained copy, consistently outperforming the original.)
- What is feature-based distillation vs response-based distillation? (Answer: Response-based: match only final outputs (logits/probabilities). Simplest but loses intermediate representation information. Feature-based (FitNets, CRD): match intermediate layer activations — student's hidden states should match teacher's hidden states at corresponding depths. Requires projection layers if student and teacher have different hidden dimensions. Feature-based distillation transfers more structural knowledge but is more complex to implement. Combined approaches (match both outputs and features) typically perform best.)
- Why is knowledge distillation particularly effective for BERT compression? (Answer: BERT is heavily over-parameterised for most downstream tasks — much of its capacity is not needed. DistilBERT retains 97% of BERT's GLUE performance with 40% fewer parameters and 60% faster inference by distilling from BERT. BERT's multi-head attention naturally produces soft, information-rich targets. Task-specific distillation (after fine-tuning) is more effective than general distillation. Further: TinyBERT distils both intermediate representations and attention matrices — achieving 97% performance with 7.5× compression.)