Definition

LoRA (Low-Rank Adaptation, Hu et al., 2021) is a parameter-efficient fine-tuning technique that adds small trainable low-rank matrices to frozen pretrained model layers. Instead of updating all billions of parameters, LoRA updates only 0.01-1% of parameters — enabling high-quality fine-tuning at a fraction of the compute and memory cost. QLoRA extends this with 4-bit quantization.

The LoRA mechanism

For each weight matrix W in the model, LoRA inserts two small trainable matrices A (d × r) and B (r × k) where r ≪ min(d, k). Only A and B are trained; W stays frozen. The effective weight becomes:

B is initialised to zero so the adapter output is zero at the start of training — the model begins as the original pretrained model. A is randomly initialised. After training, BA can be merged into W with zero inference overhead.

LoRA with PEFT library — fine-tuning LLaMA 3 8B

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

lora_config = LoraConfig(
    r=8,                          # rank — higher = more capacity, more params
    lora_alpha=16,                # scaling: effective update = (alpha/r) * BA = 2x
    target_modules=[              # apply LoRA to these projection matrices
        "q_proj", "k_proj", "v_proj", "o_proj",   # attention
        "gate_proj", "up_proj", "down_proj",        # feedforward (MLP)
    ],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.52

# After training, merge adapter into base weights (zero inference overhead)
merged_model = model.merge_and_unload()

Why low-rank works: the weight change hypothesis

The core intuition: the weight changes needed for task adaptation occupy a low-dimensional subspace of the full parameter space. Two pieces of evidence:

Evidence	Finding	Implication
Intrinsic dimensionality (Aghajanyan 2020)	Many NLP tasks can be learned by optimising only ~200 parameters in a transformed space	Task adaptation is inherently low-dimensional
LoRA paper (Hu 2021)	ΔW matrices during full fine-tuning have very low stable rank (singular value spectrum dominated by top-r values)	LoRA directly captures the low-rank structure — not much information is lost
Scaling rank r	Quality plateaus at r=8–16 for most tasks; larger r rarely helps	The intrinsic rank of task adaptation is typically ≤ 16

Practical implication

LoRA with r=8 on a 7B model trains ~40M parameters (0.5%) but achieves 95–98% of full fine-tuning quality. The adapter file is ~150MB — trivial to distribute, store, and switch between. One base model can host dozens of task-specific LoRA adapters, switched at runtime.

LoRA hyperparameters and best practices

Hyperparameter	Typical values	Effect	Guidance
Rank r	4, 8, 16, 64	Capacity of adapter — higher rank = more parameters	Start with r=8; increase to 16–64 only for complex tasks
Alpha (α)	Same as r, or 2×r	Scaling: update magnitude = (α/r) × BA	α=r gives scale=1×; α=2r gives 2× — often better in practice
Target modules	All attention + FFN projections	Which matrices get adapters	Apply to all linear layers for best quality
Dropout	0.0–0.1	Regularisation on adapter matrices	Use 0.05 for small datasets; 0.0 for large datasets
Learning rate	1e-4 to 5e-4	Step size for adapter updates only	Higher LR than full FT is fine — adapters start from zero

rsLoRA for high ranks

Standard LoRA scales by α/r — as r increases, the update magnitude decreases (instability at high r). rsLoRA (rank-stabilised LoRA) scales by α/√r instead, enabling stable training at r=128 or higher. Useful when a task genuinely needs higher capacity.

QLoRA: fine-tuning 70B models on one GPU

QLoRA (Dettmers et al., 2023) combines LoRA with 4-bit quantization, making it possible to fine-tune enormous models on a single consumer GPU:

Innovation	What it does	Memory saving
NF4 (Normal Float 4-bit)	Data type optimised for normally-distributed weights — preserves more precision at distribution centre	4× smaller than FP16 weights
Double quantization	Quantise the quantisation constants themselves	Extra ~0.4 bits/parameter saved
Paged optimisers	Move optimiser states (Adam moments) to CPU RAM during peak GPU usage	Prevents OOM on memory spikes

QLoRA: fine-tune LLaMA 3 70B on a single 48GB A100

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
import torch

# 4-bit NF4 quantisation config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # Normal Float 4-bit
    bnb_4bit_compute_dtype=torch.bfloat16,  # compute in BF16
    bnb_4bit_use_double_quant=True,     # double quantisation
)

# Load 70B model quantised to 4-bit (~35GB instead of 140GB)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B",
    quantization_config=bnb_config,
    device_map="auto",
)

# Add LoRA adapters in BF16 on top of frozen 4-bit base
lora_config = LoraConfig(r=16, lora_alpha=32,
                         target_modules=["q_proj","v_proj","k_proj","o_proj"],
                         lora_dropout=0.05, task_type="CAUSAL_LM")
model = get_peft_model(model, lora_config)
# Peak GPU memory: ~40GB — fits on a single 48GB A100 or 4× 12GB consumer GPUs

LoRA vs full fine-tuning vs alternatives

Method	Params updated	GPU for 7B	Quality vs full FT	Inference overhead
Full fine-tuning	100%	~80GB (FP16 + Adam)	100%	None
LoRA (r=8)	~0.5%	~16GB	95–98%	None (merge after training)
QLoRA (4-bit + LoRA)	~0.5%	~6GB	92–96%	None (merge after training)
Adapter layers	~1–3%	~17GB	93–96%	⚠️ Small latency — can't be merged
Prefix tuning	<0.1%	~14GB	80–88%	None but wastes context tokens
Prompt tuning	<0.01%	~14GB	70–85%	None but only works at large scale
DoRA (Weight-Decomposed LoRA)	~0.5%	~16GB	97–99%	None (merge after training)

DoRA in 2025

DoRA (Liu et al., 2024) decomposes weights into magnitude and direction components, applying LoRA only to the direction. This consistently outperforms LoRA by 1–3% on most tasks with identical parameter count. DoRA is now the recommended default in most PEFT use cases where slightly better quality is worth the marginal implementation complexity.

Practice questions

What is the mathematical justification for LoRA's low-rank assumption? (Answer: The hypothesis: weight updates ΔW = W_finetuned - W_pretrained have low intrinsic rank. Empirical evidence (Aghajanyan et al. 2020): fine-tuned models can be represented in low-dimensional intrinsic subspaces. The pretrained model already captures most of the relevant structure; task-specific adaptation requires a low-rank modification. LoRA decomposes ΔW = BA where B ∈ ℝ^{d×r}, A ∈ ℝ^{r×k}, r << min(d,k). Hypothesis confirmed: r=4–16 achieves near full-rank fine-tuning quality on most NLP tasks.)
How does LoRA merge adapters at inference time for zero overhead? (Answer: LoRA adds ΔW = BA to the frozen W₀. At inference: output = (W₀ + BA)x = W₀x + BAx. Pre-merging: must compute both W₀x and BAx, then add — two matrix multiplications. Post-merging: compute W_merged = W₀ + BA (one-time operation), then output = W_merged × x — single matrix multiplication, identical speed to original model. Merging is a pre-deployment step: load LoRA weights, add to base weights, save. The merged model is indistinguishable from full fine-tuning at serving time.)
What is the effect of LoRA rank r on model capacity and training stability? (Answer: Low r (1–4): very few trainable parameters (~100K for 7B model). Fast, memory-efficient, less risk of overfitting. May underfit complex tasks. Medium r (8–32): standard range for most fine-tuning. Balances capacity and efficiency. High r (64–256): approaches full fine-tuning capacity. More parameters but still much cheaper than full FT. Stability: very high r can cause instability if learning rate is not reduced proportionally. Practical guideline: start with r=16, tune if underfitting/overfitting observed.)
What is the difference between applying LoRA to attention weights only vs all linear layers? (Answer: Attention-only LoRA (q_proj, v_proj, k_proj, o_proj): targets the self-attention mechanism — where most task-specific information integration happens. Fewer parameters, faster. Original LoRA paper used q, v projections only. All-linear LoRA (attention + MLP + embeddings): more capacity to adapt. Usually better accuracy on complex tasks requiring deep factual changes. Memory cost: ~4× more LoRA parameters. For instruction following and style: attention-only sufficient. For domain knowledge adaptation (medical, legal): all-linear LoRA recommended.)
What is QLoRA and how does it enable 65B model fine-tuning on a single GPU? (Answer: QLoRA (Dettmers et al. 2023): (1) Load base model in 4-bit NF4 (Normal Float 4) quantisation — 65B model: ~35GB instead of ~130GB. (2) Add LoRA adapters in BF16. (3) Train only LoRA adapters (base model frozen and quantised). (4) During forward/backward pass: dequantise base weights to BF16 on the fly for computation, then discard. Memory: 35GB (quantised base) + 4GB (LoRA + optimizer states) ≈ 39GB — fits on a single 40GB A100. Quantisation adds <1% performance loss vs FP16 LoRA on most tasks.)

LoRA & QLoRA

The LoRA mechanism

Why low-rank works: the weight change hypothesis

LoRA hyperparameters and best practices

QLoRA: fine-tuning 70B models on one GPU

LoRA vs full fine-tuning vs alternatives

Practice questions

Try LumiChats for ₹69

Related Terms