Glossary/Fine-tuning
Model Training & Optimization

Fine-tuning

How AI models are specialised for specific tasks.


Definition

Fine-tuning is the process of taking a pretrained model and continuing to train it on a smaller, task-specific dataset. It adjusts the model's parameters to improve performance on a specific task, domain, or style — building on the general knowledge already learned during pretraining rather than starting from scratch.

Pre-training vs fine-tuning

StageDataCostGoalWho does it
PretrainingTrillions of tokens of internet text, code, books$1M–$100M+ in computeLearn general world knowledge + languageAI labs (OpenAI, Meta, Anthropic, Google)
Instruction fine-tuning (SFT)Thousands–millions of (instruction, response) pairs$10–$10,000 on cloud GPUsTeach model to follow instructions helpfullyLabs + companies building on top of base models
Alignment fine-tuning (RLHF/DPO)Human or AI preference pairs$1,000–$100,000Make model safe, helpful, honestPrimarily AI labs
Domain fine-tuningDomain-specific documents + Q&A$50–$5,000Specialize model for a vertical (medical, legal, code)Companies, researchers, developers

The LIMA insight

The LIMA paper (2023) demonstrated that just 1,000 carefully curated, high-quality instruction examples produced a model competitive with models trained on 52,000 pairs. Quality matters far more than quantity in SFT data — a finding that reshaped how fine-tuning datasets are built.

Instruction fine-tuning (SFT)

Supervised Fine-Tuning on instruction-following data transforms a base LLM (which just predicts the next token) into an assistant that follows instructions. The data format is simple (instruction, response) pairs:

SFT data format and training with HuggingFace TRL

from datasets import Dataset
from trl import SFTTrainer, SFTConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

# SFT training data: (instruction, response) pairs
data = [
    {
        "messages": [
            {"role": "user", "content": "Summarize this text in 2 sentences: [article text]"},
            {"role": "assistant", "content": "The article discusses..."}
        ]
    },
    # ... thousands more high-quality examples
]
dataset = Dataset.from_list(data)

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=SFTConfig(
        output_dir="./llama-sft",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,   # effective batch = 16
        learning_rate=2e-5,
        warmup_ratio=0.1,
        lr_scheduler_type="cosine",
        bf16=True,                        # bfloat16 — faster, same quality as fp16
        logging_steps=10,
        save_strategy="epoch",
    ),
)
trainer.train()

What makes good SFT data

(1) Diversity — cover the full range of tasks the model should handle. (2) Quality over quantity — one human-expert response beats 100 AI-generated ones. (3) Correct format — responses should model ideal assistant behaviour (helpful, clear, appropriately concise). (4) No contamination — test set benchmarks must not appear in training data.

Catastrophic forgetting

Fine-tuning improves task performance but can erase general capabilities — the model 'forgets' what it knew during pretraining. This happens because gradient descent for the new task increases loss on the original data distribution:

MitigationHow it worksCostEffectiveness
LoRA / PEFTOnly update small adapter — 99% of weights frozenLow⭐⭐⭐⭐ — best default choice
ReplayMix original pretraining data into fine-tuning batchesMedium (need original data)⭐⭐⭐⭐
EWC (Elastic Weight Consolidation)Penalize changes to weights important for old tasks (via Fisher info)Medium⭐⭐⭐
Low learning rateFine-tune with LR 10–100× smaller than pretrainingNone⭐⭐
Short fine-tuningStop early before forgetting accumulatesNone⭐⭐

The alignment tax

RLHF and SFT can reduce raw capability on knowledge benchmarks (MMLU, HumanEval) even as they improve helpfulness and safety. This is the "alignment tax" — models become more pleasant to talk to but may perform worse on raw capability evals. Careful data curation and LoRA help minimise it.

Full fine-tuning vs PEFT

MethodParams updatedGPU memory (7B model)Relative qualityUse case
Full fine-tuning100% (7B)~80GB (FP16 + optimizer states)100%When you have A100/H100 cluster and large dataset
LoRA (r=8)~0.1% (7M)~16GB95–98%Standard choice — great quality/cost ratio
QLoRA (4-bit + LoRA)~0.1%~6GB92–95%Consumer GPU or limited VRAM — democratizes fine-tuning
Prefix tuning~0.1% (soft tokens)~16GB85–90%Rarely used — underperforms LoRA
Prompt tuning<0.01% (prompt tokens)~14GB80–85%Only competitive at very large model scale (>10B)
Adapter layers~0.5–3%~17GB93–96%Works well but adds inference latency (can't be merged)

Domain-specific fine-tuning: real examples

DomainModelTraining dataKey result
CodeDeepSeek-Coder, CodestralThe Stack (2.8TB code), GitHubOutperform GPT-3.5 on HumanEval despite being smaller
MedicineMed-PaLM 2 (Google)Medical texts, USMLE Q&A, clinical notesExpert-level performance on USMLE (passing score in all categories)
LawHarvey AI (GPT-4 based)Legal documents, case law, contractsUsed by Am Law 100 law firms for contract review
FinanceBloombergGPT363B tokens of financial text + general webOutperforms general LLMs on financial NLP benchmarks
MathDeepSeek-Math, MammothMATH dataset + synthetic chain-of-thoughtSOTA on MATH benchmark — rivals proprietary models
ScienceGalactica (Meta)48M scientific papers, references, codeExcels at scientific question answering and formula generation

The key insight

Targeted fine-tuning on high-quality domain data often outperforms much larger general models on domain-specific tasks. A 7B model fine-tuned on 50K medical Q&A examples can outperform GPT-4 on medical licensing exams — at 1/100th the inference cost.

Practice questions

  1. What is the difference between full fine-tuning, LoRA, and prompt tuning in terms of memory requirements? (Answer: Full fine-tuning: update ALL weights — needs: model weights (14GB for 7B BF16) + gradients (14GB) + Adam optimizer states (28GB for FP32) ≈ 56GB for 7B. LoRA (r=16): freeze all weights, add small A,B matrices — needs: frozen weights (14GB) + LoRA gradients (~200MB) + LoRA optimizer (~400MB) ≈ 15GB for 7B. Prompt tuning: add soft prompt tokens, train only their embeddings — needs: model weights (inference memory) + ~1MB for prompt parameters. LoRA is the practical sweet spot for single-GPU fine-tuning.)
  2. What is the learning rate recommendation for LoRA fine-tuning vs full fine-tuning? (Answer: Full fine-tuning: 1e-5 to 5e-5. Small LR needed because all weights (pretrained knowledge) are updated — too large destroys pretraining. LoRA: 1e-4 to 3e-4. Larger LR is appropriate because only the small A,B matrices (which start near zero) are updated — the frozen pretrained weights remain unchanged. LoRA matrices need to learn from scratch, requiring higher LR to converge in reasonable training time.)
  3. What is catastrophic forgetting and how does fine-tuning mitigate it? (Answer: Catastrophic forgetting: training on new data overwrites gradients from old data, causing the model to lose previously learned capabilities. Full fine-tuning on a narrow dataset (e.g., SQL generation only) can severely degrade general language capabilities. Mitigations: (1) Small learning rate: minimal updates to pretrained knowledge. (2) LoRA: frozen pretrained weights cannot forget. (3) Regularisation toward original weights (EWC). (4) Replay: include some general training examples in the fine-tuning mix. (5) Short training duration (1–3 epochs typically).)
  4. What is task-specific fine-tuning vs instruction fine-tuning, and which produces a more flexible model? (Answer: Task-specific: fine-tune on (input, output) pairs for ONE task (SQL generation, sentiment classification). Expert in that task but cannot generalise to new instructions. Instruction fine-tuning: train on thousands of diverse tasks expressed as natural language instructions. The model learns to follow novel instructions even for tasks not seen during training. Instruction-tuned models are more flexible deployments (single model handles many tasks) but may underperform narrow specialists on their specific domain.)
  5. Why do fine-tuned models sometimes perform worse than base models on the target task? (Answer: Common causes: (1) Too few fine-tuning examples (<100): overfitting to the small dataset. (2) Too many epochs: continued training on limited data causes memorisation. (3) LR too high: destructive updates to pretrained knowledge. (4) Training data format mismatch: the model expects a specific prompt format that doesn't match the fine-tuning examples. (5) Evaluation on wrong distribution: the fine-tuned model excels on the training distribution but test examples are formatted differently. Always validate with a held-out test set.)

On LumiChats

LumiChats' open-source models (LumiCoder-7B, LumiStudy-3B, LumiReason-13B) are all fine-tuned from base open-source models using LoRA on task-specific datasets.

Try it free

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

5 terms