Definition

Modern LLM fine-tuning uses a streamlined stack: Unsloth (2-4× faster training, 60% less VRAM via custom CUDA kernels), LoRA/QLoRA (train 0.1% of parameters), and HuggingFace TRL (GRPO, PPO, SFT trainers). A 7B model that previously required 4× A100s (80GB each) can now be fine-tuned on a single 16GB consumer GPU. This democratisation means anyone can customise a state-of-the-art model for their specific domain — legal documents, medical Q&A, custom personas, or specialised code generation.

The complete fine-tuning stack in 2025

Complete supervised fine-tuning with Unsloth (SFT pattern)

# ═══════════════════════════════════════════════════════════
# Complete SFT (Supervised Fine-Tuning) workflow with Unsloth
# Based on patterns from the Granite/Qwen/Llama notebooks
# Runs on free Google Colab T4 (16GB VRAM)
# ═══════════════════════════════════════════════════════════

# Step 0: Install
# pip install unsloth trl transformers datasets bitsandbytes

from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset, Dataset
import torch

# ── Step 1: Load model with automatic optimisations ──
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name     = "unsloth/Llama-3.2-1B-Instruct",  # or Qwen3, Granite4 etc.
    max_seq_length = 2048,        # Max tokens per example
    dtype          = None,        # Auto-detect: BF16 on Ampere+, FP16 on older
    load_in_4bit   = True,        # 4-bit quantisation: 7B uses ~4GB VRAM instead of 14GB
)

# ── Step 2: Add LoRA adapters ──
model = FastLanguageModel.get_peft_model(
    model,
    r=16,               # LoRA rank. 8-64 common. Higher = more params = more capacity
    lora_alpha=16,      # Scaling factor = lora_alpha/r. Often set equal to r.
    lora_dropout=0.0,   # 0 works well for LoRA (unlike vanilla dropout)
    bias="none",        # Recommended: no bias in LoRA layers
    target_modules=[    # Which attention/MLP matrices to add LoRA to
        "q_proj", "k_proj", "v_proj", "o_proj",   # Attention matrices
        "gate_proj", "up_proj", "down_proj",       # MLP matrices
    ],
    use_gradient_checkpointing="unsloth",  # Saves VRAM at cost of slight speed
)

# Show trainable parameter count
model.print_trainable_parameters()
# trainable params: 27,262,976 / 1,235,814,400 = 2.21% (for 1B model)

# ── Step 3: Prepare dataset with chat template ──
# Use standard chat format that matches the model's instruction template
train_data = [
    {
        "messages": [
            {"role": "system",    "content": "You are a helpful SQL expert."},
            {"role": "user",      "content": "How do I get the top 10 customers by revenue?"},
            {"role": "assistant", "content": "Use ORDER BY with LIMIT:

SELECT customer_id, SUM(amount) AS revenue
FROM orders
GROUP BY customer_id
ORDER BY revenue DESC
LIMIT 10;"},
        ]
    },
    # Add thousands more examples...
]
dataset = Dataset.from_list(train_data)

# Apply chat template to format as model expects
def format_chat(example):
    text = tokenizer.apply_chat_template(
        example["messages"],
        tokenize=False,
        add_generation_prompt=False
    )
    return {"text": text}

dataset = dataset.map(format_chat)

# ── Step 4: Configure and run SFT trainer ──
sft_config = SFTConfig(
    output_dir             = "./sft_output",
    dataset_text_field     = "text",
    max_seq_length         = 2048,
    num_train_epochs       = 3,
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 4,   # Effective batch = 2 × 4 = 8
    warmup_steps           = 10,
    learning_rate          = 2e-4,     # Higher LR for LoRA vs full FT
    weight_decay           = 0.01,
    lr_scheduler_type      = "cosine",
    bf16                   = True,     # BF16 compute (auto on Ampere+)
    fp16                   = False,
    logging_steps          = 10,
    save_steps             = 100,
    report_to              = "none",   # "tensorboard" or "wandb" for tracking
)

trainer = SFTTrainer(
    model         = model,
    tokenizer     = tokenizer,
    train_dataset = dataset,
    args          = sft_config,
)

# ── Step 5: Train ──
trainer.train()

# ── Step 6: Save and use the fine-tuned model ──
# Save LoRA adapters only (tiny: ~50MB for 7B model)
model.save_pretrained("./lora_adapters")
tokenizer.save_pretrained("./lora_adapters")

# For GGUF/Ollama deployment (quantise for local inference)
model.save_pretrained_gguf("./gguf_model", tokenizer,
    quantization_method="q4_k_m")   # 4-bit quantisation for CPU inference

# ── Step 7: Inference with the fine-tuned model ──
FastLanguageModel.for_inference(model)   # Enable 2× faster inference mode
inputs = tokenizer([
    tokenizer.apply_chat_template([
        {"role": "user", "content": "How do I join two tables in SQL?"}
    ], tokenize=False, add_generation_prompt=True)
], return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Vision fine-tuning with Unsloth (from Qwen/PaddleOCR notebooks)

Vision-language model fine-tuning pattern

# Based on Qwen3_5__2B__Vision.ipynb and Qwen3_5__4B__Vision.ipynb notebooks
# Fine-tuning a vision-language model (VLM) on image+text tasks

from unsloth import FastVisionModel   # Vision-specific fast loading

model, tokenizer = FastVisionModel.from_pretrained(
    model_name   = "unsloth/Qwen2-VL-2B-Instruct",  # 2B vision-language model
    max_seq_length = 2048,
    dtype          = None,
    load_in_4bit   = True,
)

# Enable only text weights for faster fine-tuning (keep vision encoder frozen)
model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers = True,    # Train vision encoder too?
    finetune_language_layers = True,  # Train language decoder
    finetune_attention_modules = True,
    finetune_mlp_modules = True,
    r=16, lora_alpha=16,
)

# Vision training dataset format
vision_data = [
    {
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "image", "image": "path/to/invoice.jpg"},
                    {"type": "text",  "text": "Extract all line items and totals from this invoice."}
                ]
            },
            {
                "role": "assistant",
                "content": "Line Items:
1. Product A: $50.00
2. Service B: $120.00
Total: $170.00"
            }
        ]
    }
]
# Use cases: document OCR, chart understanding, medical imaging, receipt extraction

Model	Params	VRAM (4-bit)	Task	Notebook
Llama-3.2-1B-Instruct-FP8	1B	~2GB	Reasoning (GRPO)	Llama_FP8_GRPO.ipynb
Qwen2-VL-2B-Instruct	2B	~2GB	Vision+Text	Qwen3_5__2B__Vision.ipynb
Qwen2-VL-4B-Instruct	4B	~4GB	Vision+Text	Qwen3_5__4B__Vision.ipynb
Granite-4.0-2B	2B	~2GB	Code+Reasoning	Granite4_0.ipynb

Practice questions

A 7B model with load_in_4bit=True uses how much VRAM? (Answer: ~4-5GB. Standard BF16 = 2 bytes × 7B = 14GB. 4-bit = 0.5 bytes × 7B = 3.5GB + overhead ≈ 4-5GB. LoRA adapters add ~200MB. Total fits in a 6-8GB consumer GPU (RTX 3060, 3070).)
What does use_gradient_checkpointing="unsloth" do? (Answer: Instead of storing all intermediate activations in VRAM during the forward pass (needed for backprop), gradient checkpointing recomputes them during the backward pass. Trades compute for memory: ~30-40% more computation but ~60% less VRAM. "unsloth" mode is Unsloth's optimised implementation that saves more memory with less compute overhead.)
Why is learning_rate=2e-4 for LoRA fine-tuning higher than 2e-5 for full fine-tuning? (Answer: LoRA only updates 0.1-2% of parameters. The small LoRA matrices (A and B) start at zero and need a larger learning rate to learn meaningful representations quickly. Full fine-tuning updates all parameters from a good starting point, requiring small LR to avoid catastrophic forgetting.)
What is the difference between saving LoRA adapters vs saving the full merged model? (Answer: LoRA adapters: ~50-200MB (just the small A and B matrices). Load: requires base model + adapter. Merged model: full model with adapters mathematically merged back into W. Load: just one model file. Use adapters for: flexibility (swap adapters), storage efficiency. Use merged for: simple deployment, sharing.)
save_pretrained_gguf with quantization_method="q4_k_m" — what does this produce? (Answer: GGUF format with Q4_K_M quantization (~4.5 bits per weight on average). Compatible with llama.cpp and Ollama for local CPU/GPU inference. A 7B model becomes ~4-5GB. Q4_K_M uses "K-quant" which preserves more precision for important weights. Good balance of size vs quality for local deployment.)

On LumiChats

The fine-tuning pattern described here — Unsloth + LoRA + TRL — is used by thousands of researchers and developers to create custom versions of LLMs. LumiChats Study Mode and domain-specific features are built on the same fine-tuning paradigm applied at production scale.

Try it free

Practical LLM Fine-Tuning — Unsloth, LoRA & the Modern Stack

The complete fine-tuning stack in 2025

Vision fine-tuning with Unsloth (from Qwen/PaddleOCR notebooks)

Practice questions

Try LumiChats for ₹69

Related Terms