Definition

Knowledge distillation is a model compression technique where a small 'student' model is trained to mimic the outputs, intermediate representations, or reasoning patterns of a much larger 'teacher' model. The result is a compact model that captures most of the teacher's capability while being dramatically cheaper to run. In 2026, distillation is the primary technique behind the small language model revolution — Phi-3, Llama 3.2, and DeepSeek's most efficient models are all heavily distilled from larger teachers.

The core idea: soft labels vs. hard labels

Standard supervised training gives a model hard labels: 'this is a cat, label=1.' Distillation uses soft labels — the full probability distribution the teacher assigns to every possible output. These soft distributions contain far more information than a single label: they reveal the teacher's uncertainty, secondary predictions, and the relationships between classes.

Hinton et al. (2015) distillation loss: a weighted sum of (1) standard cross-entropy with the ground truth, and (2) KL divergence between the teacher and student output distributions at temperature T. Temperature T > 1 softens the distributions, amplifying information in the non-maximum logits. α balances the two objectives.

Training signal	Information content	Example
Hard label (standard training)	Binary — right or wrong	"cat" = 1, everything else = 0
Teacher soft label (distillation)	Rich — reveals relationships	"cat" = 0.92, "lynx" = 0.05, "dog" = 0.02, "car" = 0.001
Intermediate features	Richer — match internal representations layer by layer	Student layer 4 output ≈ Teacher layer 12 output (feature distillation)
Reasoning traces	Richest — match step-by-step thinking	Student generates same chain-of-thought steps as teacher (chain-of-thought distillation)

Types of distillation used in modern LLMs

Type	What is matched	How it works	Used in
Response distillation (black-box)	Final outputs only	Generate teacher outputs; train student on those outputs as supervised data	DeepSeek-R1 distilled models; most SLMs fine-tuned on GPT-4 outputs
Logit distillation (white-box)	Full output probability distributions	Requires access to teacher logits (not just text); uses KL divergence loss	Internal lab distillation pipelines; not possible with closed APIs
Feature distillation	Intermediate hidden states	Add auxiliary loss: student layer i output ≈ teacher layer j output	TinyBERT; DistilBERT; efficient vision models
Chain-of-thought distillation	Reasoning traces / thinking steps	Fine-tune student on teacher's step-by-step reasoning, not just final answers	Key to DeepSeek-R1 distillation; creates small reasoning models
Speculative decoding	Functional: student drafts tokens, teacher verifies	Not classic distillation but uses a small student as a draft model; teacher corrects in batches	GPT-4 inference optimization; Llama production serving

Response distillation pipeline — the most practical form: fine-tune a small model on teacher outputs

from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from trl import SFTTrainer
import anthropic
import json

# ── Step 1: Generate teacher outputs ─────────────────────────────────────
client = anthropic.Anthropic()

def get_teacher_response(prompt: str) -> str:
    """Get a high-quality response from Claude Sonnet (the teacher)."""
    msg = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}]
    )
    return msg.content[0].text

# Your domain-specific prompts (e.g. customer support, medical Q&A, legal)
student_prompts = [
    "What are the side effects of ibuprofen?",
    "Explain the difference between a debit card and a credit card.",
    # ... thousands more domain prompts
]

distillation_data = []
for prompt in student_prompts:
    response = get_teacher_response(prompt)
    distillation_data.append({
        "text": f"<|user|>\n{prompt}\n<|assistant|>\n{response}"
    })

# ── Step 2: Fine-tune a small student model on teacher outputs ────────────
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

dataset = Dataset.from_list(distillation_data)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    args=TrainingArguments(
        output_dir="./phi3-distilled-domain",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        learning_rate=2e-5,
        # Use LoRA via peft_config for efficiency — see LoRA article
    ),
)
trainer.train()
# Result: A 3.8B model with domain expertise from a 100B+ teacher.

Distillation vs. other compression techniques

Technique	How it works	Compression ratio	Quality loss	Best for
Distillation	Train new smaller model to mimic the teacher	10–100× parameter reduction	Low — the student is specifically trained to be accurate	When you can afford the training compute; best overall quality-per-size
Quantization	Reduce parameter bit-width (FP32 → INT8 or INT4)	4–8× memory reduction; same architecture	Minimal with careful calibration	Deployment on existing models; no retraining needed; fastest to apply
Pruning	Remove individual weights or entire layers below a threshold	2–10× parameter reduction	Moderate — requires fine-tuning after pruning to recover quality	Structured pruning of specific attention heads or FFN layers
Architecture search (NAS)	Automatically find the most efficient architecture for a target	Varies widely	Low — model is designed to be efficient from scratch	Large-scale production; resource-intensive to run

In practice: combine them

The Phi-3 Mini was distilled from a large teacher model AND then quantized to INT4 for edge deployment. This stack — distillation for quality compression, then quantization for memory compression — is the standard recipe for deploying capable models on consumer hardware in 2026.

Practice questions

What is 'dark knowledge' and why does it improve student model training? (Answer: Dark knowledge: the probability distribution a teacher model assigns to all classes (soft labels). Example: for an image of a cat, the teacher might output [cat: 0.7, tiger: 0.2, dog: 0.1]. The non-cat probabilities encode the teacher's knowledge about similarity relationships — tiger is more cat-like than dog. Training on hard labels [cat: 1, tiger: 0, dog: 0] loses this similarity structure. Soft labels carry much more information per example, enabling the student to learn better representations with less data.)
What is temperature scaling in knowledge distillation and why is it important? (Answer: Temperature T in soft targets: p_i = exp(z_i/T) / Σ exp(z_j/T). High T flattens the distribution — makes the soft labels even softer, exposing more dark knowledge (tiny probabilities become more visible). Low T sharpens — approaches hard labels. Standard distillation uses T=3–5 for soft targets, T=1 for hard targets. The final loss combines soft loss (at T) + hard loss (at T=1): L = α × T² × KL(teacher_soft, student_soft) + (1-α) × CE(student_logits, hard_labels). T² rescales the soft gradient to match hard gradient magnitude.)
What is the difference between offline distillation, online distillation, and self-distillation? (Answer: Offline: train teacher fully first, then train student on teacher outputs — classic approach. Teacher is fixed throughout student training. Online (mutual learning): teacher and student train simultaneously, sharing knowledge with each other. No pretrained teacher needed — multiple students teach each other. Self-distillation: a model distils knowledge to itself — deeper layers teach shallower layers, or later training epochs teach earlier epochs. Born-Again Networks: retrain same architecture using soft labels from a trained copy, consistently outperforming the original.)
What is feature-based distillation vs response-based distillation? (Answer: Response-based: match only final outputs (logits/probabilities). Simplest but loses intermediate representation information. Feature-based (FitNets, CRD): match intermediate layer activations — student's hidden states should match teacher's hidden states at corresponding depths. Requires projection layers if student and teacher have different hidden dimensions. Feature-based distillation transfers more structural knowledge but is more complex to implement. Combined approaches (match both outputs and features) typically perform best.)
Why is knowledge distillation particularly effective for BERT compression? (Answer: BERT is heavily over-parameterised for most downstream tasks — much of its capacity is not needed. DistilBERT retains 97% of BERT's GLUE performance with 40% fewer parameters and 60% faster inference by distilling from BERT. BERT's multi-head attention naturally produces soft, information-rich targets. Task-specific distillation (after fine-tuning) is more effective than general distillation. Further: TinyBERT distils both intermediate representations and attention matrices — achieving 97% performance with 7.5× compression.)

Knowledge Distillation

The core idea: soft labels vs. hard labels

Types of distillation used in modern LLMs

Distillation vs. other compression techniques

Practice questions

Try LumiChats for ₹69

Related Terms