Quantization reduces the numerical precision of model weights and activations — from 32-bit or 16-bit floating-point to 8-bit or 4-bit integers. This slashes memory requirements (a 70B model at FP16 needs ~140GB; at INT4, ~35GB) while preserving most performance, making powerful models deployable on consumer hardware.
Floating-point precision explained
Every model parameter is stored as a number. The precision format determines how many bytes each number uses — and therefore the total memory footprint:
| Format | Bits | Bytes/param | 7B model size | 70B model size | Typical use |
|---|---|---|---|---|---|
| FP32 (float32) | 32 | 4 | 28 GB | 280 GB | Pretraining, gradient computation |
| BF16 (bfloat16) | 16 | 2 | 14 GB | 140 GB | Training + inference (A100/H100) |
| FP16 (float16) | 16 | 2 | 14 GB | 140 GB | Inference on older GPUs (V100) |
| INT8 | 8 | 1 | 7 GB | 70 GB | Quantised inference — near-lossless |
| INT4 / NF4 | 4 | 0.5 | 3.5 GB | 35 GB | Quantised inference — standard for local LLMs |
| INT2 / INT3 | 2–3 | 0.25–0.375 | ~2 GB | ~17 GB | Extreme compression — noticeable quality loss |
Why BF16 over FP16?
BF16 and FP16 both use 16 bits but allocate them differently. FP16: 5 exponent + 10 mantissa. BF16: 8 exponent + 7 mantissa (same exponent range as FP32). BF16 can represent much larger/smaller values without overflow — critical during training when gradient magnitudes vary widely. Modern AI GPUs (A100, H100, RTX 4090) have native BF16 tensor cores.
Post-Training Quantization (PTQ) vs Quantization-Aware Training (QAT)
| Method | Description | Quality | Cost | Best for |
|---|---|---|---|---|
| Naive PTQ (absmax) | Scale weights linearly to INT8 range | Good for INT8, poor for INT4 | Negligible | Quick INT8 deployment |
| GPTQ | Layer-by-layer quantisation using second-order (Hessian) information to minimise per-layer error | Excellent — near FP16 quality at INT4 | Hours on 1 GPU | Offline GPU inference (vLLM, AutoGPTQ) |
| AWQ (Activation-aware) | Identifies important weights via activation magnitude, protects them from quantisation | Better than GPTQ at INT4 | Hours on 1 GPU | Production GPU inference — state of the art |
| GGUF / llama.cpp | CPU-friendly quantisation with mixed precision per tensor group | Good — especially Q4_K_M | Minutes | Local CPU/Apple Silicon inference |
| QAT (Quantization-Aware Training) | Simulate quantisation noise during training — model adapts | Best quality at any bit width | Full retraining budget | When maximum quality at low bit width is required |
AWQ vs GPTQ in 2025
AWQ consistently outperforms GPTQ at the same bit width — the key insight is that not all weights are equally important. AWQ identifies the ~1% of weights with the highest activation magnitudes and preserves their precision. For production GPU serving, AWQ INT4 is the current best practice.
GGUF and llama.cpp: running LLMs locally
GGUF (GPT-Generated Unified Format) is the quantisation format used by llama.cpp — a pure C++ LLM inference library that runs on CPU, Apple Silicon, and consumer GPUs with no CUDA required:
Running a quantised LLM locally with llama.cpp
# Install llama.cpp (macOS with Metal GPU acceleration)
brew install llama.cpp
# Download a GGUF model (Llama 3.1 8B Q4_K_M = 4.9GB)
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
# Run inference (uses Apple Silicon GPU via Metal)
llama-cli -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -n 512 -p "Explain quantum computing in 3 sentences" --gpu-layers 99 # offload all layers to GPU
# GGUF quantisation levels (Q4_K_M recommended):
# Q2_K: ~2.6GB, significant quality loss
# Q4_K_M: ~4.9GB, excellent quality — best size/quality balance
# Q5_K_M: ~5.7GB, near-lossless
# Q8_0: ~8.5GB, essentially identical to FP16
# F16: ~15GB, full precisionRunning 70B on a MacBook Pro
A MacBook Pro M3 Max with 128GB unified memory can run LLaMA 3 70B Q4_K_M (~40GB) entirely in memory at ~8–12 tokens/second. Apple Silicon's unified memory architecture (no separate VRAM) makes it uniquely capable for large quantised models — the memory bandwidth (400+ GB/s on M3 Ultra) is the only bottleneck.
Mixed precision and hardware support
| Hardware | Native INT8 ops | Native INT4 ops | Unified memory | Best for |
|---|---|---|---|---|
| NVIDIA A100 80GB | ✅ 624 TOPS | ✅ (via FP8 Transformer Engine) | ❌ | Large model training + serving |
| NVIDIA H100 80GB | ✅ 1979 TOPS INT8 | ✅ FP8 native | ❌ | Frontier model training |
| NVIDIA RTX 4090 24GB | ✅ 1457 TOPS INT8 | ⚠️ via software | ❌ | Consumer fine-tuning + inference |
| Apple M3 Max 128GB | ✅ (ANE) | ✅ (ANE) | ✅ 400GB/s | Local large model inference |
| Apple M4 Ultra | ✅ | ✅ | ✅ 800GB/s | Best local inference available (2025) |
Mixed precision inference pattern
Production LLM inference uses mixed precision: weights stored in INT4 on disk/VRAM, loaded and dequantised to BF16 for actual matrix multiplications (accumulation in BF16/FP32 preserves numerical stability), then results cast back. This pattern (store in INT4, compute in BF16) achieves 90–95% of the memory reduction with near-FP16 accuracy.
Quantization's impact on model quality
| Bit width | Quality vs FP16 (large models 70B+) | Quality vs FP16 (small models 3–7B) | Recommended? |
|---|---|---|---|
| FP16 / BF16 | 100% (baseline) | 100% (baseline) | ✅ If VRAM allows |
| INT8 (GPTQ/AWQ) | ~99.5% — essentially lossless | ~98% — minimal degradation | ✅ Yes — free performance at half the memory |
| INT4 (GPTQ/AWQ) | ~96–98% — barely noticeable | ~92–95% — slightly noticeable on hard tasks | ✅ Yes — standard for local inference |
| INT4 (GGUF Q4_K_M) | ~96% — comparable to AWQ | ~91–94% | ✅ Yes — best for CPU/Apple Silicon |
| INT3 / INT2 | ~85–90% — noticeable regression | ~75–85% — significant degradation | ⚠️ Only when size is critical |
The 70B INT4 vs 13B FP16 principle
A 70B parameter model at INT4 (~35GB) fits on the same GPU as a 13B model at FP16 (~26GB). The 70B INT4 almost always outperforms the 13B FP16 — larger models tolerate lower precision much better than smaller ones. This makes quantisation the standard approach for maximising capability-per-dollar in deployment.
Practice questions
- What is the difference between post-training quantisation (PTQ) and quantisation-aware training (QAT)? (Answer: PTQ: quantise a trained model's weights without any retraining — fast (minutes), no training data required. Quality: INT8 PTQ achieves ~1% accuracy drop; INT4 PTQ achieves ~3-5% drop. QAT: simulate quantisation during training (fake quantise and dequantise weights in the forward pass, train with full precision gradients). The model adapts to quantisation noise. Quality: QAT matches full precision accuracy at INT8; INT4 QAT achieves ~1% drop. Required for very low bit (INT4 and below) without significant accuracy loss.)
- What is GPTQ and why is it important for LLM quantisation? (Answer: GPTQ (Frantar et al. 2022): layer-by-layer post-training quantisation using second-order weight updates (Hessian-based). For each layer, it iteratively quantises weights and compensates for quantisation error by updating the remaining unquantised weights in that layer — using the inverse Hessian of the loss. Achieves INT4 quantisation of 175B GPT-3 in 4 GPU-hours with <1% perplexity increase. Critical for making large LLMs practically deployable: GPT-J, LLaMA, and Falcon all have GPTQ-quantised versions serving millions of users.)
- What is AWQ (Activation-Aware Weight Quantisation) and how does it improve on GPTQ? (Answer: AWQ (Lin et al. 2023): identifies salient weights (those corresponding to large activation channels) and protects them by scaling before quantisation. Observation: 1% of weights are crucial — they correspond to channels with very large input activations. GPTQ quantises all weights equally. AWQ scales crucial weight channels by an activation-dependent factor before quantisation, effectively giving them higher precision. Result: AWQ achieves better perplexity than GPTQ at same bit width, especially at very low precision (INT3). AWQ is the preferred quantisation method in llama.cpp and many deployment frameworks.)
- What is the difference between weight-only quantisation and weight-activation quantisation? (Answer: Weight-only (W4A16): quantise model weights to INT4/INT8; keep activations and computations in FP16. Reduces model size (memory bandwidth) but not compute FLOPs. Dequantise weights to FP16 before matrix multiply. Memory-bandwidth-bound operations benefit; compute-bound operations do not. Weight-activation (W8A8): quantise both weights AND activations to INT8. Enables INT8 matrix multiply (much faster on A100/H100 Tensor Cores). Requires careful per-token activation quantisation — LLM.int8() and SmoothQuant handle this.)
- A 7B model at FP16 requires 14GB VRAM. What quantisation enables running it on a 6GB GPU? (Answer: INT4 weight-only quantisation (GPTQ/AWQ/4-bit NF4): each weight is 4 bits instead of 16 bits — 4× size reduction. 14GB / 4 = 3.5GB model size + ~1.5GB for KV cache and activations ≈ 5GB total. Fits in a 6GB GPU. bitsandbytes library (QLoRA): uses NF4 (Normal Float 4-bit) which better preserves values near zero (where most weights cluster after training). Practical: llama.cpp -q4_k_m flag or HuggingFace load_in_4bit=True in BitsAndBytesConfig.)