Modern LLM fine-tuning uses a streamlined stack: Unsloth (2-4× faster training, 60% less VRAM via custom CUDA kernels), LoRA/QLoRA (train 0.1% of parameters), and HuggingFace TRL (GRPO, PPO, SFT trainers). A 7B model that previously required 4× A100s (80GB each) can now be fine-tuned on a single 16GB consumer GPU. This democratisation means anyone can customise a state-of-the-art model for their specific domain — legal documents, medical Q&A, custom personas, or specialised code generation.
The complete fine-tuning stack in 2025
Complete supervised fine-tuning with Unsloth (SFT pattern)
# ═══════════════════════════════════════════════════════════
# Complete SFT (Supervised Fine-Tuning) workflow with Unsloth
# Based on patterns from the Granite/Qwen/Llama notebooks
# Runs on free Google Colab T4 (16GB VRAM)
# ═══════════════════════════════════════════════════════════
# Step 0: Install
# pip install unsloth trl transformers datasets bitsandbytes
from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset, Dataset
import torch
# ── Step 1: Load model with automatic optimisations ──
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Llama-3.2-1B-Instruct", # or Qwen3, Granite4 etc.
max_seq_length = 2048, # Max tokens per example
dtype = None, # Auto-detect: BF16 on Ampere+, FP16 on older
load_in_4bit = True, # 4-bit quantisation: 7B uses ~4GB VRAM instead of 14GB
)
# ── Step 2: Add LoRA adapters ──
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank. 8-64 common. Higher = more params = more capacity
lora_alpha=16, # Scaling factor = lora_alpha/r. Often set equal to r.
lora_dropout=0.0, # 0 works well for LoRA (unlike vanilla dropout)
bias="none", # Recommended: no bias in LoRA layers
target_modules=[ # Which attention/MLP matrices to add LoRA to
"q_proj", "k_proj", "v_proj", "o_proj", # Attention matrices
"gate_proj", "up_proj", "down_proj", # MLP matrices
],
use_gradient_checkpointing="unsloth", # Saves VRAM at cost of slight speed
)
# Show trainable parameter count
model.print_trainable_parameters()
# trainable params: 27,262,976 / 1,235,814,400 = 2.21% (for 1B model)
# ── Step 3: Prepare dataset with chat template ──
# Use standard chat format that matches the model's instruction template
train_data = [
{
"messages": [
{"role": "system", "content": "You are a helpful SQL expert."},
{"role": "user", "content": "How do I get the top 10 customers by revenue?"},
{"role": "assistant", "content": "Use ORDER BY with LIMIT:
SELECT customer_id, SUM(amount) AS revenue
FROM orders
GROUP BY customer_id
ORDER BY revenue DESC
LIMIT 10;"},
]
},
# Add thousands more examples...
]
dataset = Dataset.from_list(train_data)
# Apply chat template to format as model expects
def format_chat(example):
text = tokenizer.apply_chat_template(
example["messages"],
tokenize=False,
add_generation_prompt=False
)
return {"text": text}
dataset = dataset.map(format_chat)
# ── Step 4: Configure and run SFT trainer ──
sft_config = SFTConfig(
output_dir = "./sft_output",
dataset_text_field = "text",
max_seq_length = 2048,
num_train_epochs = 3,
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4, # Effective batch = 2 × 4 = 8
warmup_steps = 10,
learning_rate = 2e-4, # Higher LR for LoRA vs full FT
weight_decay = 0.01,
lr_scheduler_type = "cosine",
bf16 = True, # BF16 compute (auto on Ampere+)
fp16 = False,
logging_steps = 10,
save_steps = 100,
report_to = "none", # "tensorboard" or "wandb" for tracking
)
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
args = sft_config,
)
# ── Step 5: Train ──
trainer.train()
# ── Step 6: Save and use the fine-tuned model ──
# Save LoRA adapters only (tiny: ~50MB for 7B model)
model.save_pretrained("./lora_adapters")
tokenizer.save_pretrained("./lora_adapters")
# For GGUF/Ollama deployment (quantise for local inference)
model.save_pretrained_gguf("./gguf_model", tokenizer,
quantization_method="q4_k_m") # 4-bit quantisation for CPU inference
# ── Step 7: Inference with the fine-tuned model ──
FastLanguageModel.for_inference(model) # Enable 2× faster inference mode
inputs = tokenizer([
tokenizer.apply_chat_template([
{"role": "user", "content": "How do I join two tables in SQL?"}
], tokenize=False, add_generation_prompt=True)
], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))Vision fine-tuning with Unsloth (from Qwen/PaddleOCR notebooks)
Vision-language model fine-tuning pattern
# Based on Qwen3_5__2B__Vision.ipynb and Qwen3_5__4B__Vision.ipynb notebooks
# Fine-tuning a vision-language model (VLM) on image+text tasks
from unsloth import FastVisionModel # Vision-specific fast loading
model, tokenizer = FastVisionModel.from_pretrained(
model_name = "unsloth/Qwen2-VL-2B-Instruct", # 2B vision-language model
max_seq_length = 2048,
dtype = None,
load_in_4bit = True,
)
# Enable only text weights for faster fine-tuning (keep vision encoder frozen)
model = FastVisionModel.get_peft_model(
model,
finetune_vision_layers = True, # Train vision encoder too?
finetune_language_layers = True, # Train language decoder
finetune_attention_modules = True,
finetune_mlp_modules = True,
r=16, lora_alpha=16,
)
# Vision training dataset format
vision_data = [
{
"messages": [
{
"role": "user",
"content": [
{"type": "image", "image": "path/to/invoice.jpg"},
{"type": "text", "text": "Extract all line items and totals from this invoice."}
]
},
{
"role": "assistant",
"content": "Line Items:
1. Product A: $50.00
2. Service B: $120.00
Total: $170.00"
}
]
}
]
# Use cases: document OCR, chart understanding, medical imaging, receipt extraction| Model | Params | VRAM (4-bit) | Task | Notebook |
|---|---|---|---|---|
| Llama-3.2-1B-Instruct-FP8 | 1B | ~2GB | Reasoning (GRPO) | Llama_FP8_GRPO.ipynb |
| Qwen2-VL-2B-Instruct | 2B | ~2GB | Vision+Text | Qwen3_5__2B__Vision.ipynb |
| Qwen2-VL-4B-Instruct | 4B | ~4GB | Vision+Text | Qwen3_5__4B__Vision.ipynb |
| Granite-4.0-2B | 2B | ~2GB | Code+Reasoning | Granite4_0.ipynb |
Practice questions
- A 7B model with load_in_4bit=True uses how much VRAM? (Answer: ~4-5GB. Standard BF16 = 2 bytes × 7B = 14GB. 4-bit = 0.5 bytes × 7B = 3.5GB + overhead ≈ 4-5GB. LoRA adapters add ~200MB. Total fits in a 6-8GB consumer GPU (RTX 3060, 3070).)
- What does use_gradient_checkpointing="unsloth" do? (Answer: Instead of storing all intermediate activations in VRAM during the forward pass (needed for backprop), gradient checkpointing recomputes them during the backward pass. Trades compute for memory: ~30-40% more computation but ~60% less VRAM. "unsloth" mode is Unsloth's optimised implementation that saves more memory with less compute overhead.)
- Why is learning_rate=2e-4 for LoRA fine-tuning higher than 2e-5 for full fine-tuning? (Answer: LoRA only updates 0.1-2% of parameters. The small LoRA matrices (A and B) start at zero and need a larger learning rate to learn meaningful representations quickly. Full fine-tuning updates all parameters from a good starting point, requiring small LR to avoid catastrophic forgetting.)
- What is the difference between saving LoRA adapters vs saving the full merged model? (Answer: LoRA adapters: ~50-200MB (just the small A and B matrices). Load: requires base model + adapter. Merged model: full model with adapters mathematically merged back into W. Load: just one model file. Use adapters for: flexibility (swap adapters), storage efficiency. Use merged for: simple deployment, sharing.)
- save_pretrained_gguf with quantization_method="q4_k_m" — what does this produce? (Answer: GGUF format with Q4_K_M quantization (~4.5 bits per weight on average). Compatible with llama.cpp and Ollama for local CPU/GPU inference. A 7B model becomes ~4-5GB. Q4_K_M uses "K-quant" which preserves more precision for important weights. Good balance of size vs quality for local deployment.)
On LumiChats
The fine-tuning pattern described here — Unsloth + LoRA + TRL — is used by thousands of researchers and developers to create custom versions of LLMs. LumiChats Study Mode and domain-specific features are built on the same fine-tuning paradigm applied at production scale.
Try it free