Glossary/LoRA & QLoRA
Model Training & Optimization

LoRA & QLoRA

Fine-tuning billion-parameter models on a single GPU.


Definition

LoRA (Low-Rank Adaptation, Hu et al., 2021) is a parameter-efficient fine-tuning technique that adds small trainable low-rank matrices to frozen pretrained model layers. Instead of updating all billions of parameters, LoRA updates only 0.01-1% of parameters — enabling high-quality fine-tuning at a fraction of the compute and memory cost. QLoRA extends this with 4-bit quantization.

The LoRA mechanism

For each weight matrix W in the model, LoRA inserts two small trainable matrices A (d × r) and B (r × k) where r ≪ min(d, k). Only A and B are trained; W stays frozen. The effective weight becomes:

B is initialised to zero so the adapter output is zero at the start of training — the model begins as the original pretrained model. A is randomly initialised. After training, BA can be merged into W with zero inference overhead.

LoRA with PEFT library — fine-tuning LLaMA 3 8B

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

lora_config = LoraConfig(
    r=8,                          # rank — higher = more capacity, more params
    lora_alpha=16,                # scaling: effective update = (alpha/r) * BA = 2x
    target_modules=[              # apply LoRA to these projection matrices
        "q_proj", "k_proj", "v_proj", "o_proj",   # attention
        "gate_proj", "up_proj", "down_proj",        # feedforward (MLP)
    ],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.52

# After training, merge adapter into base weights (zero inference overhead)
merged_model = model.merge_and_unload()

Why low-rank works: the weight change hypothesis

The core intuition: the weight changes needed for task adaptation occupy a low-dimensional subspace of the full parameter space. Two pieces of evidence:

EvidenceFindingImplication
Intrinsic dimensionality (Aghajanyan 2020)Many NLP tasks can be learned by optimising only ~200 parameters in a transformed spaceTask adaptation is inherently low-dimensional
LoRA paper (Hu 2021)ΔW matrices during full fine-tuning have very low stable rank (singular value spectrum dominated by top-r values)LoRA directly captures the low-rank structure — not much information is lost
Scaling rank rQuality plateaus at r=8–16 for most tasks; larger r rarely helpsThe intrinsic rank of task adaptation is typically ≤ 16

Practical implication

LoRA with r=8 on a 7B model trains ~40M parameters (0.5%) but achieves 95–98% of full fine-tuning quality. The adapter file is ~150MB — trivial to distribute, store, and switch between. One base model can host dozens of task-specific LoRA adapters, switched at runtime.

LoRA hyperparameters and best practices

HyperparameterTypical valuesEffectGuidance
Rank r4, 8, 16, 64Capacity of adapter — higher rank = more parametersStart with r=8; increase to 16–64 only for complex tasks
Alpha (α)Same as r, or 2×rScaling: update magnitude = (α/r) × BAα=r gives scale=1×; α=2r gives 2× — often better in practice
Target modulesAll attention + FFN projectionsWhich matrices get adaptersApply to all linear layers for best quality
Dropout0.0–0.1Regularisation on adapter matricesUse 0.05 for small datasets; 0.0 for large datasets
Learning rate1e-4 to 5e-4Step size for adapter updates onlyHigher LR than full FT is fine — adapters start from zero

rsLoRA for high ranks

Standard LoRA scales by α/r — as r increases, the update magnitude decreases (instability at high r). rsLoRA (rank-stabilised LoRA) scales by α/√r instead, enabling stable training at r=128 or higher. Useful when a task genuinely needs higher capacity.

QLoRA: fine-tuning 70B models on one GPU

QLoRA (Dettmers et al., 2023) combines LoRA with 4-bit quantization, making it possible to fine-tune enormous models on a single consumer GPU:

InnovationWhat it doesMemory saving
NF4 (Normal Float 4-bit)Data type optimised for normally-distributed weights — preserves more precision at distribution centre4× smaller than FP16 weights
Double quantizationQuantise the quantisation constants themselvesExtra ~0.4 bits/parameter saved
Paged optimisersMove optimiser states (Adam moments) to CPU RAM during peak GPU usagePrevents OOM on memory spikes

QLoRA: fine-tune LLaMA 3 70B on a single 48GB A100

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
import torch

# 4-bit NF4 quantisation config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # Normal Float 4-bit
    bnb_4bit_compute_dtype=torch.bfloat16,  # compute in BF16
    bnb_4bit_use_double_quant=True,     # double quantisation
)

# Load 70B model quantised to 4-bit (~35GB instead of 140GB)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B",
    quantization_config=bnb_config,
    device_map="auto",
)

# Add LoRA adapters in BF16 on top of frozen 4-bit base
lora_config = LoraConfig(r=16, lora_alpha=32,
                         target_modules=["q_proj","v_proj","k_proj","o_proj"],
                         lora_dropout=0.05, task_type="CAUSAL_LM")
model = get_peft_model(model, lora_config)
# Peak GPU memory: ~40GB — fits on a single 48GB A100 or 4× 12GB consumer GPUs

LoRA vs full fine-tuning vs alternatives

MethodParams updatedGPU for 7BQuality vs full FTInference overhead
Full fine-tuning100%~80GB (FP16 + Adam)100%None
LoRA (r=8)~0.5%~16GB95–98%None (merge after training)
QLoRA (4-bit + LoRA)~0.5%~6GB92–96%None (merge after training)
Adapter layers~1–3%~17GB93–96%⚠️ Small latency — can't be merged
Prefix tuning<0.1%~14GB80–88%None but wastes context tokens
Prompt tuning<0.01%~14GB70–85%None but only works at large scale
DoRA (Weight-Decomposed LoRA)~0.5%~16GB97–99%None (merge after training)

DoRA in 2025

DoRA (Liu et al., 2024) decomposes weights into magnitude and direction components, applying LoRA only to the direction. This consistently outperforms LoRA by 1–3% on most tasks with identical parameter count. DoRA is now the recommended default in most PEFT use cases where slightly better quality is worth the marginal implementation complexity.

Practice questions

  1. What is the mathematical justification for LoRA's low-rank assumption? (Answer: The hypothesis: weight updates ΔW = W_finetuned - W_pretrained have low intrinsic rank. Empirical evidence (Aghajanyan et al. 2020): fine-tuned models can be represented in low-dimensional intrinsic subspaces. The pretrained model already captures most of the relevant structure; task-specific adaptation requires a low-rank modification. LoRA decomposes ΔW = BA where B ∈ ℝ^{d×r}, A ∈ ℝ^{r×k}, r << min(d,k). Hypothesis confirmed: r=4–16 achieves near full-rank fine-tuning quality on most NLP tasks.)
  2. How does LoRA merge adapters at inference time for zero overhead? (Answer: LoRA adds ΔW = BA to the frozen W₀. At inference: output = (W₀ + BA)x = W₀x + BAx. Pre-merging: must compute both W₀x and BAx, then add — two matrix multiplications. Post-merging: compute W_merged = W₀ + BA (one-time operation), then output = W_merged × x — single matrix multiplication, identical speed to original model. Merging is a pre-deployment step: load LoRA weights, add to base weights, save. The merged model is indistinguishable from full fine-tuning at serving time.)
  3. What is the effect of LoRA rank r on model capacity and training stability? (Answer: Low r (1–4): very few trainable parameters (~100K for 7B model). Fast, memory-efficient, less risk of overfitting. May underfit complex tasks. Medium r (8–32): standard range for most fine-tuning. Balances capacity and efficiency. High r (64–256): approaches full fine-tuning capacity. More parameters but still much cheaper than full FT. Stability: very high r can cause instability if learning rate is not reduced proportionally. Practical guideline: start with r=16, tune if underfitting/overfitting observed.)
  4. What is the difference between applying LoRA to attention weights only vs all linear layers? (Answer: Attention-only LoRA (q_proj, v_proj, k_proj, o_proj): targets the self-attention mechanism — where most task-specific information integration happens. Fewer parameters, faster. Original LoRA paper used q, v projections only. All-linear LoRA (attention + MLP + embeddings): more capacity to adapt. Usually better accuracy on complex tasks requiring deep factual changes. Memory cost: ~4× more LoRA parameters. For instruction following and style: attention-only sufficient. For domain knowledge adaptation (medical, legal): all-linear LoRA recommended.)
  5. What is QLoRA and how does it enable 65B model fine-tuning on a single GPU? (Answer: QLoRA (Dettmers et al. 2023): (1) Load base model in 4-bit NF4 (Normal Float 4) quantisation — 65B model: ~35GB instead of ~130GB. (2) Add LoRA adapters in BF16. (3) Train only LoRA adapters (base model frozen and quantised). (4) During forward/backward pass: dequantise base weights to BF16 on the fly for computation, then discard. Memory: 35GB (quantised base) + 4GB (LoRA + optimizer states) ≈ 39GB — fits on a single 40GB A100. Quantisation adds <1% performance loss vs FP16 LoRA on most tasks.)

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

5 terms