Definition

Inference is the process of running a trained AI model to generate predictions or outputs — as opposed to training, which is the process of adjusting model weights. In LLMs, inference is autoregressive token generation: predicting one token at a time from left to right. Efficient inference is critical for cost and latency in production AI systems.

How LLM inference works

LLM inference has two distinct phases with very different computational characteristics — understanding them is key to optimizing serving cost and latency.

Phase	What happens	Compute type	Bottleneck
Prefill	All input (prompt) tokens processed in a single parallel forward pass	Compute-bound: all tokens computed in parallel	FLOPS — add more GPUs to speed up
Decode	Generate output tokens one at a time, each requiring a full forward pass	Memory-bandwidth-bound: reads all KV cache per step	GPU memory bandwidth — hard to parallelize

Autoregressive generation: at each step t, the model produces hidden state h_t from all preceding tokens, projects to vocabulary via W_U (unembedding matrix), and samples the next token. With KV cache, h_t computation only requires the new token — past key-value pairs are cached.

Why decode is memory-bandwidth-bound

During decode, generating each token requires reading all model weights (~140GB for a 70B model in FP16) from GPU HBM to SRAM — but only performs a tiny amount of compute (one token's worth). An H100 has 3.35TB/s memory bandwidth but 2,000 TFLOPS. For a single-token forward pass, the model is memory-bound by ~1000×. This is why batching decode requests dramatically improves GPU utilization — amortizing the weight read cost across many concurrent tokens.

KV cache and memory management

The KV cache is the single most important memory structure in LLM inference — understanding it explains why long contexts are expensive and why batching is complex.

KV cache memory formula: 2 (K and V) × layers × heads × head_dim × sequence_length (T) × batch_size (B) × bytes per element. For LLaMA 3 70B: 2 × 80 × 8 × 128 × T × B × 2 bytes (FP16) = 0.32MB per token per batch element. At T=8K, B=32: 82GB just for KV cache.

Technique	What it does	Memory savings
KV cache	Cache past key-value pairs; reuse in each decode step	O(n²) → O(n) compute; enables long contexts
PagedAttention (vLLM)	Pages KV cache like OS virtual memory; shares pages across requests	20–40% higher GPU utilization; no memory fragmentation
Quantized KV cache	Store KV cache in INT8 or FP8 instead of FP16	2–4× memory reduction with minimal quality loss
Sliding window attention	Only keep KV cache for last W tokens (e.g., W=4096)	O(1) memory instead of O(n) — at cost of long-range attention
Multi-Query Attention (MQA)	Share K and V heads across all attention heads	8–32× smaller KV cache (used in LLaMA 3, Mistral, Gemma)

Inference optimization techniques

Technique	Mechanism	Speedup	Tradeoff
Speculative decoding	Draft model generates k tokens fast; main model verifies all k in one parallel pass	2–3×	Requires matching draft model; benefit varies with acceptance rate
Continuous batching	Process tokens from multiple requests in same batch; replace finished sequences immediately	5–10× throughput	Higher latency for individual requests
FlashAttention 2/3	Fused CUDA kernel keeps Q,K,V in fast SRAM; avoids HBM round-trips	2–4× attention speed, 5–20× memory	NVIDIA/AMD specific; needs CUDA
Tensor parallelism	Split attention heads or FFN dimensions across GPUs; all-reduce each layer	Linear with # GPUs	Communication overhead; needs fast interconnect (NVLink)
Pipeline parallelism	Different model layers on different GPUs; micro-batching to hide bubbles	Linear with # GPUs	Micro-batch latency; bubble overhead
AWQ / GPTQ quantization	Quantize weights to INT4/INT8; reduce memory bandwidth bottleneck	1.5–4× throughput	Slight quality loss; calibration required

Speculative decoding in depth

Speculative decoding works because: (1) generating draft tokens with a small model (e.g., 3B) is much faster than the main model (70B), and (2) verifying k tokens in parallel with the main model is no slower than generating 1 token — the forward pass is the same shape. If the draft model has ~80% token acceptance rate, you get ~3× speedup for free with identical output. Claude uses speculative decoding in production; Medusa (self-speculative with multiple heads) avoids needing a separate draft model.

LLM inference infrastructure in 2025

Framework / Service	Type	Best for	Key feature
vLLM	Open-source server	Production throughput-optimized serving	PagedAttention, continuous batching, multi-LoRA
Ollama	Open-source local	Local dev, single-machine serving	One-command model download + serve; GGUF support
llama.cpp	Open-source library	CPU inference, low-VRAM GPU, edge deployment	Quantized GGUF; CPU+GPU split; runs on MacBooks
TensorRT-LLM	NVIDIA framework	Maximum performance on NVIDIA GPUs	FP8, kernel fusion, speculative decoding; H100 optimized
SGLang	Open-source server	Structured generation, complex multi-call workflows	RadixAttention (KV cache sharing across similar prefixes)
Groq LPU	Cloud inference	Fastest token generation speed	Custom LPU chip: 500+ tokens/sec on 70B; not cheapest
Together AI / Fireworks	Managed API	Cheap open-source model inference	Per-token pricing, open-source model access
AWS Bedrock / Vertex AI	Enterprise managed	Enterprise compliance + multi-provider access	SLA, VPC, audit logging, fine-tune hosting

Cost benchmark (early 2025)

GPT-4o: ~$10–15/M output tokens. Claude 3.5 Sonnet: ~$15/M. Llama 3.1 70B via Together AI: ~$0.88/M. Self-hosted Llama 3.1 70B on vLLM (4× A100 80GB): ~$0.20/M at full utilization. The 50–75× cost gap between frontier closed models and self-hosted open-source explains why companies with high token volumes increasingly fine-tune open-source models for production.

Latency vs throughput tradeoffs

Latency and throughput are fundamentally in tension for LLM serving — optimizing one hurts the other. Choosing the right operating point depends on your use case.

Metric	Definition	Typical target	Critical for
TTFT (Time To First Token)	Time from request sent to first token received	<500ms for interactive	Chatbots, coding assistants — perceived responsiveness
TPOT (Time Per Output Token)	Average time between consecutive output tokens	<50ms (~20 tok/s)	Streaming readability — faster than human reading speed
End-to-end latency	Total time from request to complete response	<5s for short responses	Non-streaming batch use cases
Throughput (tokens/sec)	Total tokens generated per second across all requests	Maximize for batch	Document processing, offline summarization pipelines
Requests per second (RPS)	Concurrent requests served	Varies by batch size	API scaling, cost efficiency

Streaming and perceived latency

Streaming (Server-Sent Events, SSE) returns tokens as they are generated — the user sees text appearing word-by-word rather than waiting for the full response. This dramatically improves perceived responsiveness even if total generation time is identical. A response that takes 5s to complete feels fast if you see the first tokens in 200ms. All major LLM APIs (OpenAI, Anthropic, Groq) support streaming; always use it for interactive applications.

Practice questions

What is the difference between throughput and latency in LLM inference, and why is there a fundamental trade-off? (Answer: Latency: time from request to first token (TTFT) + time to generate full response. User-facing — measures how fast responses feel. Throughput: total tokens generated per second across all concurrent users. Server-side — measures capacity. Trade-off: batching requests improves throughput (processes many tokens simultaneously) but increases latency for individual users (must wait for batch). At batch_size=1: minimum latency. At batch_size=256: maximum throughput. Production serving optimises the Pareto frontier between these, using continuous batching to approach both simultaneously.)
What is continuous batching (iteration-level scheduling) and why did it transform LLM serving? (Answer: Traditional static batching: group N requests together, wait until ALL finish generating, then start next batch. If request 1 finishes in 10 tokens and request 2 in 1000 tokens, request 1's GPU slot sits idle for 990 token-steps. Continuous batching (Orca, vLLM): as soon as a request finishes, its slot is immediately replaced with a new request. The batch changes composition at every token generation step. Result: GPU utilisation goes from ~20% (static) to ~80%+ (continuous). vLLM pioneered this; it is now the standard in all production LLM serving systems.)
What is TTFT (Time to First Token) and why is it more important than total generation time for user experience? (Answer: TTFT: elapsed time from request submission until the first output token is generated. Covers: network latency + prompt processing (prefill) + scheduling queue wait. User experience: TTFT determines how quickly the UI can show 'something is happening.' A response that streams from token 1 in 500ms feels faster than a response that starts in 2000ms — even if both complete in 5 seconds. This is why streaming is universal in production LLM APIs: show the first token immediately rather than waiting for completion.)
What hardware is used for LLM inference and what determines model serving cost? (Answer: Primary hardware: NVIDIA H100 (80GB, $30K), H200 (141GB, $40K), A100 (80GB, $10K). AMD MI300X: competitive, gaining traction. Google TPUv5: used for internal Google serving. Cost drivers: (1) GPU VRAM (must hold model weights + KV cache). (2) GPU compute (tokens/sec per GPU). (3) Memory bandwidth (memory-bandwidth-bound decoding phase). Pricing: H100 SXM: $2–3/GPU-hour on cloud. Serving a 70B model: ~4 H100s needed, ~$8-12/hour, ~100 tokens/second → $0.023–0.033/1K output tokens (similar to commercial API pricing).)
What is PagedAttention (used in vLLM) and how does it reduce memory waste in LLM serving? (Answer: Standard KV cache: pre-allocated contiguously for max_sequence_length. For 2048-token max: 2048 positions reserved even if request only generates 100 tokens → 95% waste. PagedAttention (Kwon et al. 2023): divides KV cache into fixed-size pages (blocks), allocating pages on demand like virtual memory. Non-contiguous pages are accessed via a block table. Result: near-zero internal fragmentation, memory utilisation from 20–30% to 90%+, supports 2–4× more concurrent requests on same hardware. PagedAttention is the core innovation of vLLM.)

Inference & Model Serving

How LLM inference works

KV cache and memory management

Inference optimization techniques

LLM inference infrastructure in 2025

Latency vs throughput tradeoffs

Practice questions

Try LumiChats for ₹69

Related Terms