Instruction tuning (also called instruction fine-tuning or IFT) is a supervised fine-tuning stage where a pretrained language model is trained on a dataset of (instruction, response) pairs — teaching it to follow natural language directives rather than simply completing text. Raw pretrained models (like the base GPT or LLaMA weights) predict the next token in any context; instruction-tuned models are trained to produce helpful, accurate responses to specific requests. Instruction tuning is the step that transforms a pretrained base model into a usable assistant.
The pretraining → instruction tuning pipeline
| Stage | Training data | Objective | Output |
|---|---|---|---|
| Pretraining | Trillions of tokens from web, books, code | Predict next token (self-supervised) | Base model: powerful text predictor, unusable as assistant |
| Supervised fine-tuning (SFT / instruction tuning) | 10K–1M (instruction, response) pairs | Cross-entropy on target responses | Instruction-following model: follows instructions but may be sycophantic |
| RLHF / DPO alignment | Human preference pairs (chosen vs rejected responses) | Reward maximisation or preference optimisation | Aligned assistant: helpful, honest, avoids harm |
The instruction tuning dataset is the key variable. Early instruction tuning used manually written datasets (InstructGPT's 13,000 human-written examples). In 2026, state-of-the-art instruction datasets are AI-generated: a strong teacher model (GPT-5.4 or Claude Sonnet 4.6) produces responses to diverse instructions; these are filtered for quality and used as training data for the student model. This 'self-play' or 'distillation' approach allows generating millions of high-quality (instruction, response) pairs at scale.
Instruction tuning a small model with HuggingFace TRL (Transformer Reinforcement Learning)
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
# Load base model (not instruction-tuned)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
# Load instruction dataset (format: {"prompt": ..., "completion": ...})
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
# SFTTrainer handles the instruction-tuning format automatically
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
args=SFTConfig(
output_dir="./llama-3.2-1b-instruct",
max_seq_length=2048,
num_train_epochs=1,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-5,
# Use chat template to format instruction pairs correctly
packing=True,
),
)
trainer.train()
trainer.save_model("./llama-3.2-1b-instruct")Key instruction tuning datasets in 2026
| Dataset | Size | Source | Licence |
|---|---|---|---|
| OpenHermes 2.5 | 1M examples | GPT-4-generated across diverse tasks | CC-BY-4.0 |
| UltraChat 200K | 200K multi-turn | GPT-3.5-Turbo synthesised conversations | CC-BY-4.0 |
| Dolly 15K (Databricks) | 15K examples | Human-written by Databricks employees | CC-BY-SA-3.0 |
| ShareGPT (GPT-4) | ~90K conversations | User-shared ChatGPT conversations | Various |
| Flan collection | 15M+ examples | Tasks from academic NLP benchmarks | Apache 2.0 |
Quality beats quantity
Instruction tuning research consistently shows that dataset quality matters more than size. A 10,000-example dataset of carefully written, diverse, and correct instruction-response pairs outperforms a 1,000,000-example dataset with noisy or low-quality responses. The LIMA paper (Zhou et al., 2023) demonstrated that 1,000 carefully curated examples could produce surprisingly competitive instruction-following behaviour — establishing the 'less is more' principle for SFT data curation.
Practice questions
- What distinguishes instruction tuning from standard supervised fine-tuning (SFT)? (Answer: Standard SFT: train on (input, output) pairs for ONE specific task. The model learns that specific task but cannot generalise to new instructions. Instruction tuning: train on thousands of diverse tasks described in natural language instructions. The model learns to follow novel instructions — generalising the instruction-following capability itself. InstructGPT: trained on 77 diverse tasks. FLAN: trained on 62 text generation tasks. Result: can follow instructions for tasks never seen during fine-tuning.)
- What is multitask instruction tuning (FLAN) and how does it differ from single-task fine-tuning? (Answer: FLAN (Fine-tuned Language Net, Google 2021): fine-tune a model on 62+ diverse NLP tasks simultaneously, each described with multiple natural language instruction templates. The model learns a meta-skill: interpreting and following instructions for any task. Single-task fine-tuning: the model only improves on that specific task. FLAN achieves zero-shot generalisation to unseen tasks — standard fine-tuning cannot.)
- What is the instruction tuning data quality problem and how do modern approaches address it? (Answer: Early instruction tuning used human-written (instruction, response) pairs — slow, expensive, limited diversity. Modern approach: LLM-generated synthetic data. GPT-4 generates high-quality responses to diverse instructions; quality filters (another LLM as judge) select the best 5–10%. Alpaca used 52K GPT-generated pairs. OpenHermes used 1M filtered pairs. The key insight: small amounts of very high-quality instruction data beat large amounts of mediocre data.)
- Why does instruction tuning require human feedback (RLHF) for the best results rather than instruction data alone? (Answer: Instruction tuning trains the model to produce responses that match the training distribution. But humans often want properties hard to capture in demonstration data: genuine helpfulness (not sycophancy), accurate uncertainty, appropriate refusals, honest disagreement. RLHF trains the model on what humans actually prefer — not what demonstration data looks like. InstructGPT showed RLHF significantly improved human ratings even when SFT demonstrations were high-quality.)
- What is the difference between task-specific instruction tuning and general instruction tuning? (Answer: Task-specific: fine-tune on (instruction, output) pairs for one domain (e.g., SQL generation, medical QA). Produces an expert specialist. General instruction tuning: diverse instructions spanning many tasks and domains. Produces a generalist instruction follower. Modern products use both: start with general instruction tuning (FLAN/InstructGPT-style), then optionally domain-specific fine-tuning for specialised deployments. General instruction tuning first prevents catastrophic forgetting when doing task-specific tuning.)