Transfer learning is the technique of taking a model trained on one task or dataset and reusing it — or adapting it — for a different but related task. Instead of training from scratch, you leverage representations learned on large datasets (often with abundant data) and transfer that knowledge to problems with limited data. Transfer learning is the foundation of modern NLP and computer vision.
Why transfer learning works
Neural networks learn features in a hierarchy — lower layers capture generic, universally useful patterns; upper layers capture task-specific abstractions. This hierarchy means lower layers almost always transfer:
| Model type | Early layers learn | Middle layers learn | Late layers learn |
|---|---|---|---|
| CNN (vision) | Gabor-like edges, colour blobs | Textures, corners, curves | Object parts, semantic concepts (face, wheel) |
| LLM (text) | Token co-occurrence, positional patterns | Syntax, grammar, POS structure | Semantics, world knowledge, reasoning |
| Speech model | Mel-frequency features, phoneme boundaries | Phonemes, prosody | Word identity, speaker style |
The "frozen layers" principle
A ResNet-50 pretrained on ImageNet's 1.2M photos has learned visual features useful for medical imaging, satellite imagery, and artwork — domains with completely different content. Because these early features are universal, freezing them and only training a new head is often enough for high performance on small datasets.
Feature extraction vs fine-tuning
| Strategy | What trains | When to use | Risk |
|---|---|---|---|
| Feature extraction (frozen) | New head only | Very small dataset (<1K examples), source ≈ target domain | Underfit if domains differ significantly |
| Partial fine-tuning | Last 1–2 blocks + head | Medium dataset, moderate domain shift | Mild catastrophic forgetting |
| Full fine-tuning | All layers with small LR | Large dataset (10K+), significant domain shift | Catastrophic forgetting without regularization |
| LoRA fine-tuning | Low-rank adapters only (0.1–1% params) | LLMs where GPU memory is a constraint | Slightly lower ceiling than full FT |
Choosing the right strategy based on dataset size
from transformers import AutoModelForSequenceClassification
import torch.nn as nn
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=5)
# ── Strategy A: Feature extraction (< 500 labelled examples) ──────────────
for param in model.bert.parameters():
param.requires_grad = False # freeze ALL BERT weights
# Only model.classifier trains — fast, no GPU needed for small models
# ── Strategy B: Partial fine-tuning (500–5K examples) ─────────────────────
for param in model.bert.parameters():
param.requires_grad = False
for param in model.bert.encoder.layer[-2:].parameters():
param.requires_grad = True # unfreeze last 2 transformer blocks
# Classifier + last 2 blocks train
# ── Strategy C: Full fine-tuning (5K+ examples) ───────────────────────────
for param in model.parameters():
param.requires_grad = True
# Use discriminative learning rates: lower LR for early layers
optimizer = torch.optim.AdamW([
{"params": model.bert.embeddings.parameters(), "lr": 1e-5},
{"params": model.bert.encoder.layer[:6].parameters(), "lr": 2e-5},
{"params": model.bert.encoder.layer[6:].parameters(), "lr": 3e-5},
{"params": model.classifier.parameters(), "lr": 5e-5},
])Pretrain-then-fine-tune in NLP
Transfer learning transformed NLP in 2018 and remains the dominant paradigm. The timeline shows how each step built on the last:
| Year | Model | Contribution | Impact |
|---|---|---|---|
| 2017 | ELMo | Context-dependent word embeddings from bidirectional LSTM | First strong contextual representations — replaced GloVe |
| 2018 | ULMFiT | LM pretraining + discriminative fine-tuning + gradual unfreezing | Proved pretrain→fine-tune works for NLP tasks |
| 2018 | BERT | Masked LM + bidirectional Transformer pretraining | SOTA on 11 NLP benchmarks with tiny fine-tuning data |
| 2020 | GPT-3 | 175B params, few-shot in-context learning without weight updates | Showed scale alone produces task generalization |
| 2022+ | ChatGPT / LLaMA | RLHF alignment on top of pretrained foundation models | Fine-tuned assistants outperform task-specific models |
Domain adaptation
When source (pretraining) and target domains differ significantly, standard fine-tuning leaves performance on the table. Domain adaptation bridges the gap:
| Technique | What it does | Data needed | Best for |
|---|---|---|---|
| Domain-adaptive pretraining (DAPT) | Continue MLM/LM pretraining on unlabelled domain text before fine-tuning | Large unlabelled domain corpus | Biomedical, legal, code — domains with distinct vocabulary |
| Task-adaptive pretraining (TAPT) | Continue pretraining on unlabelled task data specifically | Unlabelled examples of your task | When task data is plentiful but labels are scarce |
| LoRA domain adapters | Train low-rank adapters on domain text | Any size domain corpus | LLMs where full pretraining is too expensive |
| Mixture of domain data | Include domain data in final fine-tuning mix | Domain + general data | Prevents forgetting general capabilities while adapting |
Domain pretraining evidence
PubMedBERT (pretrained entirely on 14M biomedical abstracts, never web text) outperforms BioBERT (general BERT fine-tuned on biomedical text) on 6 of 7 biomedical NLP benchmarks — proving from-scratch domain pretraining beats adaptation. But for most practical cases, DAPT (continue pretraining on domain text) gets 80% of the benefit at 5% of the compute.
Zero-shot and few-shot transfer
The most powerful manifestation of transfer learning is performing entirely new tasks without any task-specific training at all:
| Transfer type | Training needed | How it works | Performance vs fine-tuning |
|---|---|---|---|
| Zero-shot | None — inference only | Model infers task from description in prompt | 70–85% of fine-tuned for frontier models |
| Few-shot (ICL) | None — inference only | 2–10 examples in context window at inference time | 80–90% of fine-tuned — scales with model size |
| Few-shot fine-tuning | 10–100 labelled examples, brief fine-tune | Weight updates from tiny labelled set | 90–95% of full fine-tuning quality |
| Full fine-tuning | 1K–1M labelled examples | Standard gradient descent on task data | 100% baseline |
When zero-shot is enough
For frontier models (GPT-4, Claude 3.5, Gemini 1.5 Pro), zero-shot is competitive for: classification, summarization, extraction, translation, and code generation. Fine-tuning still wins for: highly specialized domains (medical, legal), consistent output format requirements, latency-sensitive applications (smaller fine-tuned model can beat large zero-shot model on speed+cost), and tasks requiring model to know proprietary information.
Practice questions
- What is the feature extraction approach vs fine-tuning approach in transfer learning? (Answer: Feature extraction: freeze the pretrained model entirely, use it only as a feature extractor (extract embeddings), train a small classifier head on top. Fast, no GPU needed for backprop through large model, prevents forgetting. Best when: small dataset, similar domain to pretraining. Fine-tuning: continue training some or all layers of the pretrained model on the new task. Slower but higher accuracy. Best when: large enough dataset to avoid overfitting, different domain from pretraining. Gradual unfreezing (start with head, then unfreeze layers from top) is a common compromise.)
- What is negative transfer and when does it occur? (Answer: Negative transfer: pretraining on source domain hurts performance on target domain compared to training from scratch. Occurs when source and target domains are fundamentally incompatible — the features learned on source data are actively misleading for the target. Example: pretraining on natural images then fine-tuning on X-ray pathology — image features (textures, colours) from natural images don't correspond to radiological features. Mitigation: use pretrained models only as initialisation, aggressive fine-tuning, or train from scratch if domains are truly incompatible.)
- What is domain adaptation and how does it differ from fine-tuning? (Answer: Fine-tuning: you have labelled data in the target domain — standard supervised training. Domain adaptation: target domain has little or no labels, but you have unlabelled target domain data. Techniques: adversarial domain adaptation (train a domain discriminator that can't distinguish source from target features — forces domain-invariant representations), self-training on target data (pseudo-labels from confident predictions), CORAL (align feature covariances across domains). Relevant when: collecting target domain labels is expensive (medical imaging, legal text).)
- How does CLIP enable zero-shot classification without any fine-tuning? (Answer: CLIP trains image and text encoders jointly on 400M image-text pairs so image embeddings and text embeddings are in the same space. Zero-shot classification: for each class, create a text prompt ('a photo of a dog') and compute its embedding. For a query image, compute its embedding and find the text prompt with highest cosine similarity. No task-specific training needed — class names ARE the classifier. The shared embedding space is the transfer: text descriptions of novel classes immediately work for image classification.)
- What is the transformer's contribution to making transfer learning universal across modalities? (Answer: Transformers with self-attention operate on sequences of tokens regardless of modality — the same architecture processes text tokens, image patches, audio frames, or video chunks. This universality enables: training one model on multiple modalities (Gemini, GPT-4o), pretraining on one modality and fine-tuning on another (CLIP image encoder fine-tuned for medical images), and transfer across domains without architecture changes. Pre-transformer, CNNs could not be directly transferred to sequence data; RNNs could not handle images. Transformers made cross-modal transfer natural.)